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Background of the Invention 

1. Field of the Invention 

The present invention relates to the field of digital data processor design, 
5 specifically to the control and operation of the instruction pipeline of the processor and 
structures associated therewith. 

2. Description of Related Technology 

10 RISC (or reduced instruction set computer) processors are well known in the 

computing arts. RISC processors generally have the fundamental characteristic of 
utilizing a substantially reduced instruction set as compared to non-RISC (commonly 
known as "CISC") processors. Typically, RISC processor machine instructions are not 
all micro-coded, but rather may be executed immediately without decoding, thereby 

15 affording significant economies in terms of processing speed. This "streamlined" 
instruction handling capability furthermore allows greater simplicity in the design of the 
processor (as compared to non-RISC devices), thereby allowing smaller silicon and 
reduced cost of fabrication. 

RISC processors are also typically characterized by (i) load/store memory 

20 architecture (i.e., only the load and store instructions have access to memory; other 
instructions operate via internal registers within the processor); (ii) unity of processor and 
compiler; and (iii) pipelining. 

Despite their many advantages, RISC processors may be prone to significant 
delays or stalls within their pipelines. These delays stem from a variety of causes, 

25 including the design and operation of the instruction set of the processor (e.g., the use of 
multi-word and/or "breakpoint" instructions within the processor's instruction set), the 
use of non-optimized bypass logic for operand routing during the execution of certain 
types of instructions, and the non-optimized integration (or lack of integration) of the data 
cache within the pipeline. Furthermore, lack of parallelism in the operation of the 

30 pipeline can result in critical path delays which reduce performance. These aspects are 
described below in greater detail. 
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Multi-word Instructions 

Many RISC processors offer programmers the opportunity to use instructions that 
span multiple words. Some multi-word instructions permit a greater number of operands 

5 and addressing modes while others enable a wider range of immediate data values. For 
multi-word immediate data, the pipelined execution of instructions has some inherent 
limitations including, inter alia, the potential for an instruction containing long immediate 
data to be impacted by a pipeline stall before the long immediate data has been 
completely fetched from memory. This stalling of an incompletely fetched piece of data 

10 has several ramifications, one of which is that the otherwise executable instruction may 
be stalled before it is necessary to do so. This leads to increased execution time and 
overhead within the processor. Stalling of the processor due to unavailabiliy of data 
causes the processor to insert one or more additional clock cycles. During these clock 
cycles the processor can not advance additional instruction execution as a general rule. 

15 This is because the incomplete data can be considered to be a blocking function. This 
blocking action is to cause execution to remain pending until the data becomes available. 
For example, consider a simple add instruction that adds two quanities and places the 
result in a third location. Providing that both pieces of data are available when needed, 
the execution completes in the normal number of cycles. Now consider the case in which 

20 one of the pieces of data is not available. In this case completion of the add instruction 
must stop until the data becomes available. The consequence of this stalling action is to 
possibly delay the completion by more than the minimum necessary time. 

Breakpoint Instructions 

25 One of the useful RISC instructions is the "breakpoint" instruction. Chiefly for 

use during the design and implementation phases of the processor (e.g., 
software/hardware integration and software debug), the breakpoint instruction causes the 
CPU to stop execution of any further instructions without some type of direct 
intervention, typically at the request of an operator. Once the breakpoint instruction has 

30 been executed by the pipeline, the CPU stops further processing until it receives some 
external signal such as an interrupt which signals to the CPU that execution should 
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resume. Breakpoint instructions typically replace or displace some other executable 
instruction which is subsequently executed upon resumption of the normal execution 
state of the CPU. 

Execution time is critical for many applications, hence minimizing so-called 

5 critical paths in the decode phase of a multi-stage pipelined CPU is an important 
consideration. Since the breakpoint instruction is a performance critical instruction during 
normal execution, the prior art practice has been to perform the breakpoint instruction 
decode and execution in the first pipeline stage of the typical four-stage pipeline (i.e., 
fetch, decode, execution, and write-back stages). Fig. 1 illustrates a typical prior art 

10 breakpoint instruction decode architecture. As shown in Fig. 1, the prior art stage 1 
configuration 100 comprises the stage 1 latch 102, instruction cache 104, instruction 
decode logic 106, instruction request address selection logic 108, the latter providing 
input to the stage 2 latches 110. The current program counter (pc) address value is input 
112 back to the stage 1 latch 102 for subsequent instruction fetch. Instruction decode, 

15 including decode of any breakpoint instructions, occurs within the instruction decode 
logic 106. However, such decoding in the first stage places unnecessary demands on the 
speed path of ordinary instruction handling. Ordinary instructions are decoded in stage 2 
(decode) of the pipeline. This stage one decode of the breakpoint instruction places 
minimum decode requirements on the first stage that are longer than would otherwise be 

20 required without having breakpoint instruction decode occur in the first stage. This result 
is due largely to the fact that the breakpoint instruction requires time to setup and disable 
a variety of functional blocks. For example, in the ARC ™ extensible RISC processor 
architecture manufactured by the Assignee hereof, functional blocks may include 
optional mulitply-accumlate hardware, Viterbi acceleration units, and other specific 

25 hardware accelerators in addition to standard functional blocks such as an arithmetic- 
logic unit, address generator units, interrupt processors and peripheral devices. Setup for 
each of these units will depend on the exact nature of the unit. For example, a single 
cycle unit for which state information is not required for the unit to function, may require 
no specialized set up. By contrast, an operation that requires mulitple pipeline stages to 

30 complete will require assertion of signals within the pipeline to ensure that and transitory 
results are safely stored in appropriate registers. Where as other instructions are simply 
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fetched in stage 1, the breakpoint instruction requires control signals to be generated to 
most elements of the core. This results in longer netlists and hence greater delays. 

Bypass logic 

Bypass logic is sometimes used in RISC processors (such as the aforementioned 
ARC core) to provide additional flexibility for routing operands to a variety of input 
options. For example, as illustrated in Fig. 2, outputs of various functional units (such as 
the first and second execute stage result selection logic) are routed back to the input of 
another functional unit; e.g., decode stage bypass operand selection logic. This bypass 
arrangement eliminates a number of load and store operations, reduces the number of 
temporary variable locations needed during execution, and stages data in the proper 
location for iterative operations. Such bypass arrangements permit software to exploit the 
nature of pipelined instruction execution. Using the prior art bypass circuitry of Fig. 2, a 
program can be configured to perform pipelined iterative algorithms. One such algorithm 
is the sum-of-products for a finite series. Since the processor performs scalar operations, 
each stage of the summation is achieved by a single multiply followed by a single 
addition of the result to a sum. This principal is illustrated by the following operation: 

Sum=0 

For 1=1 to n do 
Sum=sum+(a(I)*b(I)); 

In a commonly used prior art CPU scheme, the value of Sum is stored in a dedicated 
general purpose register or in a memory location. Each iteration requires a memory fetch 
or register access operation to calculate the next summation in the series. Since the CPU 
can only perform a limited number of memory or register accesses per cycle, this form 
may execute relatively slowly in comparison to a single cycle ideal for the sum-of- 
products operation (i.e., where the sum-of-products is calculated entirely within a single 
instruction cycle), or even in comparison to a non-single cycle operation where memory 
fetches or register accesses are not required in each iteration of the operation. 



Data Cache Integration 

For a number of instruction types within the instruction set of the typical RISC 
processor, there in no requirement for or need to stall the pipeline. However, some other 
instruction types will require a stall to occur. The ordinary prior art method for 
5 integrating a data cache with a processor core relies on a technique that assumes that the 
worst case evaluation for stalls must be applied to even those cases where the most 
extreme case specifically does not apply. This "worst case" approach leads to an 
increased number of pipeline stalls (and/or increased duration for each stall) as well as 
increased overhead, thereby resulting ultimately in increased execution time and reduced 
10 pipeline speed. 

Fig. 3 is a logical block diagram illustrating typical prior art data cache 
integration. It assumes the cache request originates directly from the pipeline rather than 
the load store queue. Note the presence of the bypass operand selection logic 302, the 
control logic hazard detection logic 304, and the multi-level latch control logic 306 
1 5 structures within the second (E2) execution stage . 

Fig. 3 a illustrates the operation of the typical prior art data cache structure of Fig. 
3 in the context of an exemplary load (Ld), move (Mov), and add (Add) instruction 
sequence. The exemplary instruction sequence is as follows: 

20 Ldr0,[rl,4] 

Mov r5,r4 independent of the load 
Add r8,r0,r9 ;dependent on first load 



First, in step 350, the Load (Ld) is requested. The Mov is then requested in step 
25 352. In step 354, the Add is requested. The Ld operation begins in step 356. Next, the 
Mov operation begins in step 358. The cache misses. Accordingly, the Add is then 
prevented from moving. 

In step 360, the Mov continues to flow down the pipeline. In step 362, the Add 
moves down the pipeline in response to the Load operation completing. The pipeline then 
30 flows with no stalls (steps 364, 366, and 368). 

Note that in the foregoing example, the Add instruction is prevented from moving 
from the decode stage of the pipeline to the first execute stage (El) for several cycles. 
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This negatively impacts pipeline performance by slowing the execution of the Add 
instruction. 

Pipeline Parallelism 

5 Often in prior art processor systems, the instruction cache pipeline integration is 

far from optimal. This results in many cases from the core effectively making the cache 
pipeline stages 0 and 1 dependent on each other. This can be seen diagrammatically in 
Fig. 4, wherein the pipeline control 402, instruction decode 404, nextpc selection 406, 
and instruction cache address selection 408, are disposed in the instruction fetch stage 

10 412 of the pipeline. The critical path of this non-optimized pipeline 400 allows the 
control path of the processor to be influenced by a slow signal/data path. Accordingly the 
slow data path must be removed if the performance of the core is to be improved. For 
example, in most core build instances, the prior art approach means the instruction fetch 
pipeline stage has an unequal duration to the other pipeline stages, and in general 

15 becomes the limiting factor in processor performance since it limits the minimum clock 
period. 

Fig. 4a is a block diagram of components and instruction flow within the non- 
optimized processor design of Fig. 4. As illustrated in Fig. 4a ; the slow signal/data path 
influences the control path for the pipeline 400. 

20 Based on the foregoing, there is a need for an improved methods and apparatus 

for enhancing pipeline operation, including reducing stalls and delays in CPU execution. 
Ideally, several aspects of pipeline operation would be optimized by such improved 
method and apparatus, including (i) handling of multi-word instructions and immediate 
data, such as in the calculation of such scalar quantities with a reduced number of 

25 memory fetches or register accesses; (ii) use of breakpoint instructions; (iii) bypass logic 
arrangement, (iv) data cache operation/integration, and (v) increased parallelism within 
the pipeline. Additionally, such improved apparatus and method would be readily 
adapted to existing processor designs and architectures, thereby minimizing the work 
necessary to integrate such functionality, as well as the impact on the processor design as 

30 a whole. 
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Summary of the Invention 

The foregoing needs are satisfied by providing an improved method and apparatus 
for enhanced performance in a pipelined processor. 

In a first aspect of the invention, a method and apparatus for avoiding the stalling of 
long immediate data instructions, so that processor performance is maximized, is disclosed. 
The invention results in not enabling the host to halt the core before an instruction with long 
immediate values in the decode stage of the pipeline has merged, thereby advantageously 
making the instructions containing long immediate data "non-stallable" on the boundary 
between the instruction opcode and the immediate data. Consequently the instruction 
containing long immediate data is treated as if the CPU was wider in word width for that 
instruction only. The method generally comprises providing a first instruction word; 
providing a second instruction word; and defiling a single large instruction word 
comprising the first and second instruction words; wherein the single large instruction word 
is processed as a single instruction within the processor's pipeline, thereby reducing pipeline 
delays. 

In a second aspect of the invention, an improved apparatus for decoding and 
executing breakpoint instructions, so that processor pipeline performance is maximized, is 
disclosed. In one exemplary embodiment, the apparatus comprises a pipeline arrangement 
with instruction decode logic operatively located within the second stage (e.g., decode stage) 
of the pipeline, thereby facilitating breakpoint instruction decode in the second stage versus 
the first stage as in prior art systems. Such decode in the second stage removes several 
critical "blockages" within the pipeline, and enhances execution speed by increasing 
parallelism therein. 

In a third aspect of the invention, an improved method for decoding and executing 
breakpoint instructions, so that processor pipeline performance is maximized, is disclosed. 
Generally, the method comprises providing a pipeline having at least first, second, and third 
stages; providing a breakpoint instruction word, the breakpoint instruction word resulting in 
a stall of the pipeline when executed; inserting the breakpoint instruction word into the first 
stage of the pipeline; and delaying decode of the breakpoint instruction word until the 
second stage of the pipeline. In one exemplary embodiment, the pipeline is a four stage 



pipeline having fetch, decode, execution, and write-back stages, and decode of the 
breakpoint instruction is delayed until the decode stage of the processor. Additionally, to 
support the decoding the breakpoint instruction in the decode stage, the method further 
comprises changing the program counter (pc) from the current value to a breakpoint pc 
value. 

In a fourth aspect of the invention, an improved method of debugging a processor 
design is disclosed. The method generally comprises providing a processor hardware 
design having a multi-stage pipeline; providing an instruction set including at least one 
breakpoint instruction adapted for use with the processor hardware design; running at 
least a portion of the instruction set (including the breakpoint instruction) on the 
processor design during debug; decoding the at least one breakpoint instruction at the 
second stage of the pipeline; changing the program counter (pc) from the current value to 
a breakpoint pc value; executing the breakpoint instruction on order to halt processor 
operation; and debugging the instruction set or hardware/instruction set integration while 
the processor is halted. 

In a fifth aspect of the invention, an apparatus for bypassing various components 
and registers within a processor so as to maximize pipeline performance is disclosed. In 
one embodiment, the apparatus comprises an improved logical arrangement employing a 
special multi-function register having a selectable "bypass mode"; when in bypass mode, 
the multi-function register is used to retain the result of a multi-cycle scalar operation (e.g., 
summation in a sum-of-products calculation), and present this result as a value to be selected 
from by a subsequent instruction. In this fashion, memory accesses to obtain such 
summation are substantially obviated, and the pipeline accordingly operates at a higher 
speed due to elimination of the delays associated with the obviated memory accesses. 

In a sixth aspect of the invention, a method for bypassing various components and 
registers within a processor so as to maximize processor performance is disclosed. In one 
embodiment, the method comprises providing a multi-function register; defining a bypass 
mode for the register, wherein the register maintains the result of a multi-cycle scalar 
operation therein during such bypass mode; performing a scalar operation a first time; 
storing the result of the operation in the register in bypass mode; obtaining the result of 



the first operation directly from the register, and performing a scalar operation a second 
time using the result of the first operation obtained from the register. 

In an seventh aspect of the invention, improved methods for increasing pipeline 
performance and efficiency by decoupling certain signals, and allowing an existing pipeline 

5 configuration to reveal more parallelism, are disclosed. The dataword fetch (e.g., ifetch) 
signal, which indicates the need to fetch instruction opcode/data from memory at the 
location being clocked into the program counter (pc) at the end of the current cycle, is made 
independent of the qualifying (validity) signal (e.g., ivalid). Additionally, the next program 
counter value signal (e.g., next_pc) is made independent of the data word supplied by the 

10 memory controller (e.g., pliw) and ivalid. The hazard detection logic and control logic of 
the pipeline is further made independent of ivalid; i.e., the stage 1, stage 2, and stage 3 
enables (enl, en2, en3) are decoupled from the ivalid (and pliw) signals, thereby decoupling 
pipeline movement. So-called "structural stalls" are further utilized when a slow functional 
unit, or operand fetch in the case of the xy memory extension, generates the next program 

15 counter signal (next_pc). The jump instruction of the processor instruction set is also moved 
from stage 2 to 3, independent of ivalid. In this case, the jump address is held if the delay 
slot misses the cache and link. Additionally, delay slot instructions are not separated from 
their associated jump instruction. 

In an eighth aspect of the invention, an improved data cache apparatus useful within 

20 a pipelined processor is disclosed. The apparatus generally comprises logic which allows the 
pipeline to advance one stage ahead of the cache. Furthermore, rather than assuming that the 
pipeline will need to be stalled under all circumstances as in prior art pipeline control logic, 
the apparatus of the present allows the pipeline to move ahead of the cache, and only stalls 
when a required data word is not provided (or other such condition necessitating a stall). 

25 Such conditional "latent" stalls enhance pipeline performance over the prior art 
configurations by eliminating conditions where stalls are unnecessarily invoked. In one 
exemplary embodiment, the pipelined processor comprises an extensible RISC-based 
processor, and the logic comprises (i) bypass operand selection logic disposed in the 
execution stage of the pipeline, and (ii) a multi-function register architecture. 

30 In a ninth aspect of the invention, an improved method of reducing pipeline delays 

due to stalling using "latent" stalls is disclosed. The method generally comprises providing 
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a processor having an instruction set and multistage pipeline; adapting the processor pipeline 
to move at least one stage ahead of the data cache, thereby assuming a data cache hit; 
detecting the presence of at least one required data word; and stalling the pipeline only when 
the required data word is not present. 
5 In a tenth aspect of the invention, an improved processor architecture utilizing one 

or more of the foregoing improvements including "atomic" instruction words, 
improved bypass logic, delayed breakpoint instruction decode, improved data cache 
architecture, and pipeline "decoupling" enhancements, is disclosed. In one exemplary 
embodiment, the processor comprises a reduced instruction set computer (RISC) having a 

10 four stage pipeline comprising instruction fetch, decode, execute, and writeback stages, 
and "latent stall" data cache architecture which allows the pipeline to advance one stage 
ahead of the cache. In another embodiment, the processor further includes an instruction 
set comprising at least one breakpoint instruction, the decoding of the breakpoint 
instruction being accomplished within stage 2 of the pipeline. The processor is also 

15 optionally configured with a multi-function register in a bypass configuration such that 
the result of one iteration of an iterative calculation is provided directly as an operand for 
subsequent iterations. 

Brief Description of the Drawings 
Fig. 1 is functional block diagram of a prior art pipelined processor breakpoint 
20 instruction decode architecture (stage 1) illustrating the relationship between the 
instruction cache, instruction decode logic, and instruction request address selection 
logic. 

Fig. 2 is block diagram of a prior art processor bypass logic architecture 
illustrating the relationship of the bypass logic to the single- and multi-cycle functional 
25 units and registers. 

Fig. 3 is functional block diagram of a prior art pipelined processor data cache 
architecture illustrating the relationship between the data cache and associated execution 
stage logic. 

Fig. 3a is graphical representation of pipeline movement within a typical prior art 
3 0 processor pipeline architecture. 
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Fig. 4 is block diagram illustrating a typical non-optimized prior art processor 
pipeline architecture and the relationship between various instructions and functional 
entities within the pipeline logic. 

Fig. 4a is a block diagram of components and instruction flow within the non- 
optimized prior art processor design of Fig. 4. 

Fig. 5 is logical flow diagram illustrating one embodiment of the long instruction 
word long immediate (limm) merge logic of the invention. 

Fig. 6 is a block diagram of one embodiment of the modified pipeline architecture 
and related functionalities according to the present invention, illustrating the enhanced 
path independence and parallelism thereof. 

Fig. 7 is a functional block diagram of one exemplary embodiment of the pipeline 
logic arrangement of the invention, illustrating the decoupling of the ivalid and pliw 
signals from the various other components of the pipeline logic. 

Fig. 8 is functional block diagram of one embodiment of the breakpoint 
instruction decode architecture (stage 1) of the present invention, illustrating the 
relationship between the instruction cache, instruction decode logic, and instruction 
request address selection logic. 

Fig. 8a is a graphical representation of the movement of the pipeline of an 
exemplary processor incorporating the improved breakpoint instruction logic of the 
invention, wherein a breakpoint instruction located with in a delay slot. 

Fig. 8b is a graphical representation of pipeline movement wherein a breakpoint 
instruction normally handled within the pipeline when a delay slot is not present. 

Fig. 8c is a graphical representation of pipeline movement during stalled jump 
and branch operation according to the present invention. 

Fig. 9 is block diagram of one embodiment of the improved bypass logic 
architecture of the present invention, illustrating the use of a multi-function register 
within the execute stage of the pipeline logic between the bypass operand selection logic 
and the single- and multi-cycle functional units. 

Fig. 10 is a logical flow diagram illustrating one embodiment of the method of 
utilizing bypass logic to maximize processor performance during iterative calculations 
(such as sum-of products) according to the invention. 
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Fig. 1 1 is a block diagram illustrating one exemplary embodiment of the modified 
data cache structure of the present invention. 

Fig. 11a is a graphical representation of pipeline movement in an exemplary 
processor incorporating the improved data cache integration according to the present 
invention. 

Fig. 12 is logical flow diagram illustrating the one exemplary embodiment of the 
method of enhancing the performance of a pipelined processor design according to the 
invention. 

Fig. 13 is a logical flow diagram illustrating the generalized methodology of 
synthesizing processor logic using a hardware description language (HDL), the 
synthesized logic incorporating the pipeline performance enhancements of the present 
invention. 

Fig. 14 is a block diagram of an exemplary RISC pipelined processor design 
incorporating various of the pipeline performance enhancements of the present invention. 

Fig. 15 is a functional block diagram of one exemplary embodiment of a 
computer system useful for synthesizing gate logic implementing the aforementioned 
pipeline performance enhancements within a digital processor device. 

Detailed Description 
Reference is now made to the drawings wherein like numerals refer to like parts 

throughout. 

As used herein, the term "processor" is meant to include any integrated circuit or 
other electronic device capable of performing an operation on at least one instruction 
word including, without limitation, reduced instruction set core (RISC) processors such 
as the ARC™ user-configurable core manufactured by the Assignee hereof, central 
processing units (CPUs), and digital signal processors (DSPs). The hardware of such 
devices may be integrated onto a single piece of silicon ("die"), or distributed among two 
or more die. Furthermore, various functional aspects of the processor may be 
implemented solely as software or firmware associated with the processor. 

Additionally, it will be recognized by those of ordinary skill in the art that the 
term "stage" as used herein refers to various successive stages within a pipelined 
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processor; i.e., stage 1 refers to the first pipelined stage, stage 2 to the second pipelined 
stage, and so forth. 

It is also noted that while the following description is cast in terms of VHSIC 
hardware description language (VHDL), other hardware description languages such as 
5 Verilog® may be used to describe various embodiments of the invention with equal 
success. Furthermore, while an exemplary Synopsys® synthesis engine such as the 
Design Compiler 2000.05 (DC00) is used to synthesize the various embodiments set forth 
herein, other synthesis engines such as Buildgates® available from, inter alia, Cadence 
Design Systems, Inc., may be used. IEEE std. 1076.3-1997, IEEE Standard VHDL 

10 Synthesis Packages, describe an industry-accepted language for specifying a Hardware 
Definition Language-based design and the synthesis capabilities that may be expected to 
be available to one of ordinary skill in the art. 

Lastly, it is noted that as used in this disclosure, the terms "breakpoint" and 
"breakpoint instruction" refer generally that class of processor instructions which result 

15 in an interrupt or halting of at least a portion of the execution or processing of 
instructions within the pipeline or associated logic units of a digital processor. As 
discussed in greater detail below, one such instruction comprises the "Brk x " class of 
instructions associated with the ARC™ extensible RISC processor previously referenced; 
however, it will be recognized that any number of different instructions meeting the 

20 aforementioned criteria may benefit from the methodology of the present invention. 

It will be noted that while the various methodologies of the invention are 
described herein in terms of a particular sequence of steps, such descriptions are only 
exemplary of the broader methods. Accordingly, the sequence of performace of such 
steps may in many cases be permuted, and/or additional steps added. Other steps may be 

25 optional. All such variations are considered to fall within the scope of the claims 
appended hereto. 

Overview 

Pipelined CPU instruction decode and execution is a common method of 
providing performance enhancements for CPU designs. Many CPU designs offer 
30 programmers the opportunity to use instructions that span multiple words. Some multi- 

-14- 



word instructions permit a greater number of operands and addressing modes, while 
others enable a wider range of immediate data values. For multi-word immediate data, 
pipelined execution of instructions has some built-in limitations. As previously discussed, 
one of these limitations is the potential for an instruction containing long immediate data 
5 to be impacted by a pipeline stall before the long immediate data has been completely 
fetched from memory. This stalling of an incompletely fetched piece of data has several 
ramifications, one of which is that the otherwise executable instruction may be stalled 
before it is necessary. This leads to increased execution time and overhead, thereby 
reducing processor performance. 

10 The present invention provides, inter alia, a way to avoid the stalling of long 

immediate data instructions so that performance is maximized. The invention further 
eliminates a critical path delay in a typical pipelined CPU by treating certain multi-word 
long immediate data instructions as a larger or "atomic" multi-word oversized instruction. 
These larger instructions are multi-word format instructions such as those employing long 

15 immediate data. Typical instruction types for the oversized instructions disclosed herein 
include "load immediate" and "jump" type instructions. 

Processor instruction execution time is critical for many applications; therefore, 
minimizing so-called "critical paths" within the decode phase of a multi-stage pipelined 
processor is also an important consideration. One approach to improving performance of the 

20 CPU in all cases is removing the speed path limitations. The present invention accomplishes 
removal of such path limitations by, inter alia, reducing the number of critical path delays in 
the control logic associated with instruction fetch and decode, including decode of 
breakpoint instructions used during processes such as debug. By moving the breakpoint 
instruction decode from stage 1 (as in the prior art) to stage 2, the present invention 

25 eliminates the speed path constraint imposed by the breakpoint instruction; stage 1 
instruction word decoding is advantageously removed from the critical path. 

Delays in the pipeline are further reduced using the methods of the present invention 
through modifications to the pipeline hazard detection and control logic (and register 
structure), which effectively reveal more parallelism in the pipeline. Pipelining of operations 

30 which span multiple cycles is also utilized to increase parallelism. 
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The present invention further advantageously permits the data cache to be integrated 
into the processor core in a manner that allows the pipeline to advance one stage ahead of 
the data cache. In the particular case of the aforementioned ARC™ extensible RISC 
processor manufactured by the Assignee hereof, since the valid signal for returning loads 
5 (i.e., "ldvalid") does not necessarily influence pipeline movement, it can be assumed that the 
data cache will "hit" (i.e., contain the appropriate data value when accessed). Such cache hit 
allows the pipeline to move on to conduct further processing. If this assumption is wrong, 
and the requested data word is needed by an execution unit in stage 3, the pipeline can then 
be stalled. This "latent stall" approach improves pipeline performance significantly, since 

10 stalls within the pipeline due to cache "misses" are invoked only on an as-needed basis. 

Appendix I provides detailed logic equations in HDL format detailing the method of 
the present invention in the context of the aforementioned ARC™ extensible RISC 
processor core. It will be recognized, however, that the logic equations of Appendix I (and 
those specifically described in greater detail below) are exemplary, and merely illustrative of 

15 the broader concepts of the invention. 

While each of the improvement elements referenced above may be used in isolation, 
it should be recognized that these improvements advantageously may be used in 
combination. In particular, the combination of an instruction memory cache with the bypass 
logic will serve to maximize instruction execution rates. Likewise, the use of a data cache 

20 minimizes data related processor stalls. Combining the breakpoint function with memory 
caches mitigates the impact of the breakpoint function. Selection of combinations of these 
functions compromises complexity with performance. It will be appreciated that the choice 
of functions may be determined by a number of factors including the end application for 
which the processor is designed. 

25 

"Atomic" Instructions 

The invention in one aspect prevents enabling the host to halt the core while an 
instruction with long immediate values in stage 2 has not merged. This results in making 
the instructions containing long immediate data non-stallable on the boundary between 
30 the instruction opcode and the immediate data. Consequently the instruction containing 
long immediate data is treated as if the CPU was wider in word width for that instruction 
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only. The foregoing functionality is specifically accomplished within the ARC™ core by 
connecting the hold_host value to the instruction merge logic, i.e. p2jnerge_valid_r and 
p21imm. Fig. 5 illustrates one exemplary embodiment of the logical flow of this 
arrangement. The method 500 generally comprises first determining whether an 
5 instruction with long immediate (limm) data is present (step 502); if so the core merge 
logic is examined to determine whether merging in stage 2 of the pipeline has occurred 
(step 504). If merging has occurred (step 506), the halt signal to the core is enabled (i.e., 
"halt permissive" per step 508), thereby allowing the core to be halted at any time upon 
initiation by the host. If merging has not occurred per step 506, then the core waits one 
10 instruction cycle (step 510) and then re-examines the merge logic to determine if merging 
has occurred. Accordingly, long immediate instructions cannot be stalled unless merging 
has occurred, which effectively precludes stalling on the instruction/immediate data word 
=15 boundary. 

-SJ Appendix I hereto provides detailed logic equations (rendered in hardware 

=■11 15 description language) of one exemplary embodiment of the functionality of Fig. 5, 
s |j specifically adapted for the aforementioned ARC core manufactured by the Assignee 

5a?s hereof. 

H Enhanced Parallelism 

"si™ 

)*l 20 As previously shown in Fig. 4, the speed of each pipeline stage in the non- 

Mfc optimized prior art pipeline structure is bound by the slowest stage. Some functional 

blocks within the instruction fetch pipeline stage of the processor are not optimally 
placed within the pipeline structure. 

Fig. 6 illustrates the impact on pipeline operation of the methods of enhanced 
25 parallelism according to the present invention. The dark shaded blocks 602, 604, 606, 
608, 610 show areas of modification. These modifications, when implemented, produce 
significant improvements to the maximum speed of the core. Specifically, full pipelining 
of the blocks as in the present embodiment allows them to overlap with other blocks, and 
hence their propagation delay is effectively hidden. It is noted that these modifications do 
30 not change the instruction set architecture (ISA) in any way, but do produce slight 
changes in the timing of 64-bit instructions, instructions in delay slots, and jump indirect 
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instructions which could need to bypass data words from slow execution units to generate 
nextpc. 

Fig. 7 is a block diagram of the modified pipeline architecture 700 according to 
one embodiment of the invention. In the modified architecture of Fig. 7, the slow cache 
path does not influence the control path (unlike that of the prior art approach of Figs. 4 
and 4a), thereby reducing processor pipeline delays. Specifically, the ivalid signal 702 
produced by the data word selection and cache "hit" evaluation logic 704 is latched into 
the first stage latch 706. Additionally, the long immediate instruction word (pliw) signal 
708 resulting from the logic 704 is latched into the first stage latch 706. 

Using the arrangement of Fig. 7, the dataword fetch (ifetch) signal 717, which 
indicates the need to fetch instruction opcode or data from memory at the location being 
clocked into the program counter (pc) at the end of the current cycle, is decoupled or 
made independent of the ivalid signal 702. This results in the instruction cache 709 
ignoring the ifetch signal 717 (except when a cache invalidate is requested, or on start- 
up). 

Additionally, due to the latching arrangement of Fig. 7, the next program counter 
signal (nextpc) 716, which is indicative of the dataword address, is made independent of 
the word supplied by the memory controller (pliw) 708 and ivalid 702. Using this 
approach, nextpc is only valid when ifetch 717 is true (i.e., required opcode or dataword 
needs to be fetched by the memory controller) and ivalid is true (apart from start-up, or 
after an ivalidate). Note that the critical path signal or unnecessarily slow signal is readily 
revealed when the "nextpc" path 416 is removed (dotted flow lines of Fig. 4a). 

The hazard detection logic 722 and pipeline control logic 724 is further made 
independent of the ivalid signal 702; i.e., the stage 1, stage 2, and stage 3 enables (enl 
727, en2 729, and en3 730, respectively) are decoupled from the ivalid signal 702. 
Therefore, influence on pipeline movement by ivalid 702 is advantageously avoided. 

Instructions with long immediate data are merged in stage 2. This merge at stage 
2 is a consequence of the foregoing independence of the hazard logic 722 and control 
logic 724 from ivalid 702; since these instructions with long immediate data are made up 
of multiple multi-bit words (e.g., two 32-bit data words), two accesses of the instruction 
cache 709 are needed. That is, an instruction with a long immediate should not move to 
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stage 3 until both the instruction and long immediate data are available in stage 2 of the 
pipeline. This requirement is also imposed for jump instructions with long immediate 
data values. In current practice, the instruction opcode comes from stage 2 and the long 
immediate data from stage 1 when a long immediate instruction is issued, that is 5 when 
5 the instruction moves to stage 3. 

The present invention further utilizes "structural stalls" to enhance pipeline 
performance such as when a slow functional unit (or operand fetch in the case of the xy 
memory extension) generates nextpc 716 (that is, jump register indirect instructions, j 
[rx], where the value of rx can be bypassed from a functional unit). As used herein, the 
10 term "structural stalls" refers to stall requirements that are defined by limitations inherent 
in the functional unit. One example of a structural stall is the operand fetch associated 
with the XY memory extension of the ARC processor. This approach advantageously 
*i allows slow forwarding paths to be removed, by prematurely stalling the impeding 

13 operation. For example, new program counter (pc) values are rarely generated by 

m 15 multipliers; if such values are generated by the multiplier, they can result in a cycle delay 
y that is a 1 cycle stall or bubble, and allow next_pc to be obtained from the register file 

zips 

yl 731. In general, the present invention exploits the stall that is inherent in generating a 

next PC address which is not sequentially linear in the address space. This occurs when a 
new PC value is calculated by an instruction such as jump. In addition, it may be 
gl 20 appreciated that certain instruction sets permit arithmetic and logic operations to directly 
J J a new PC. Such computations also introduce a structural stall which under some 

circumstances may be exploited to continue operation of the CPU. 

In addition to the foregoing, the present invention further removes or optimizes 
remaining critical paths within the processor using selective pipelining of operations. 
25 Specifically, paths that can be extended over more than one processor cycle with no 
processor performance loss can be selectively pipelined if desired. As an example, the 
process of (i) activating sleep mode, (ii) stopping the core, and (iii) detecting a breakpoint 
instruction, does not need to be performed in a single cycle, and accordingly is a 
candidate for such pipelining. 

30 
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Breakpoint Instruction Decode Architecture 

Referring now to Fig. 8, one embodiment of the modified breakpoint architecture 
of the invention is described. As illustrated in Fig. 8, the architecture 800 comprises 
generally a first stage latch (register) 801, an instruction cache 802, instruction request 
5 selection logic 804, an intermediate (e.g., second stage) latch 806, and instruction decode 
logic 808. The instruction cache 802 stores or caches instructions received from the latch 
801 which are to be decoded by the instruction decode logic 808, thereby obviating at 
least some program memory accesses. The design and operation of instruction (program) 
caches is well known in the art, and accordingly will not be described further here. The 
10 instruction word(s) stored within the instruction cache 802 is/are provided to the 
instruction request address selection logic 804, which utilizes the program counter 
(nextpc) register to identify the next instruction to be fetched, based on data 810 (e.g., 16- 
y bit word) from the instruction decode logic 808 and the current instruction word. This 

ft] data includes such information as condition codes and other instruction state information, 

j|l 15 assembled into a logical collection of information which is not necessarily physically 
y assembled. For example, a condition code by itself may select an alternative instruction 

ffi ot be fetched. The address from which the instruction is to be fetched may be identified 

f % by a ariety of words such as the contents of a register or a data word from memory. The 

W instruction word provided to the instruction request logic 804 is then passed to the 

l|l 20 intermediate latch 806, and read out of that latch on the next successive clock cycle by 
ft the instruction decode logic 808. 

Hence, in the case of a breakpoint instruction, the decode of the instruction (and 
its subsequent execution) in the present embodiment is delayed until stage 2 of the 
pipeline. This is in contrast to the prior art decode arrangement (Fig. 1), wherein the 
25 instruction decode logic 808 is disposed immediately following the instruction cache 802, 
thereby providing for immediate decode of a breakpoint instruction after it is moved out 
of the instruction cache 802 (i.e., in the first stage), which places the decode operation in 
the critical path. 

Additionally, in order to move the breakpoint instruction decode to stage 2 as 
30 described above, the program counter (pc) of the present embodiment is changed from the 
current value to the breakpoint pc value through a simple assignment. This modification is 
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required based on timing considerations; specifically, by the time the breakpoint instruction 
is decoded, the pc has already been updated to point to the next instruction. Hence, the pc 
value must be "reset" back to the breakpoint instruction value to account for this decoding 
delay. 

5 The following examples illustrate the operation of the modified breakpoint 

instruction decode architecture of the present invention in detail. 

Example 1- Delay Slot 

Fig. 8a and the discussion following hereafter illustrate how a breakpoint 
instruction located with in a delay slot is processed using the present invention. As is well 
10 known in the digital processing arts, delay slots are used in conjunction with certain 
instruction types for including an instruction which is executed during execution of the 
& parent instruction. For example, a "jump delay slot" is often used to refer to the slot 

k ¥ within a pipeline subsequent to a branching or jump instruction being decoded. The 

p instruction after the branch (or load) is executed while awaiting completion of the 

pj 15 branch/load instruction. It will be recognized that while the example of Fig. 8a is cast in 
terms of a breakpoint instruction disposed in the delay slot after a "Jump To" instruction, 
other applications of delay slots may be used, whether alone or in conjunction with other 
I s * instruction types, consistent with the present invention. 

Note that as used herein, the nomenclature " <name><Address>" refers to the instruction 
Q 20 name at a given address. For example, "J.d A " refers to a "Jump To" instruction at address 
A. 

In step 820 of Fig. 8a, an instruction (e.g, "Jump To" at address A, or "J.d A ") is 
requested. Next, the breakpoint instruction at address B (Brks) is requested in step 822. 
In step 824, the target address at address C (Targetc) is requested. The target address is 
25 saved in the second operand register or the long immediate register of the processor in the 
illustrated example. The instruction in the fetch stage is killed. 

Next, in step 826, the breakpoint instruction of step 822 above (Brks) is decoded. The 
current pc value is updated with the value of lastpc, the address of Brke rather than the 
address of Targetc, as previously described. An extra state is also implemented in the 
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present embodiment to indicate (i) that a 'breakpoint restart' is needed, and (ii) if the 
breakpoint instruction was disposed in a delay slot (which in the present example it is). 

In step 828, the "Jump To" instruction J.dA completes, and once all other multi-cycle 
instructions have completed, the core is halted, reporting a break instruction. Next, in step 
830, the host takes control and changes Brk B to Adds (for example, by a "write" to main 
memory). The host then invalidates the memory mapping of address B by either 
invalidating the entire cache or invalidating the associated cache line. The host then starts 
the core running. 

After the core is running, the add instruction at address B, Adds, is fetched using the 
current program counter value (currentpc) in step 832. Then, in step 834, the target value 
at address C (Targetc) is requested, using the target address from stage 3 of the pipeline. 
The current program counter value (currentpc) is set equal to the Targetc address. In step 
836, Target2 c is requested. Lastly, in step 838, the Target3 c is requested. 

Note that in the example of Fig. 8a above, the breakpoint instruction execution is 
complicated by the presence of a delay slot. This requires the processor to restart 
operation at the delay slot after the completion of the breakpoint instruction. The 
instruction at the delay slot address is then executed, followed by the instruction at the 
address specified by the jump instruction. The program continues from the target address. 

Example 2 - Non-delay Slot Breakpoint Use 

Fig. 8b and subsequent discussion illustrate how a breakpoint instruction is 
normally handled within the pipeline when a delay slot is not present. 

First, in step 840, an add at address A (AddA) is requested. A breakpoint 
instruction at address B (Brk B ) is then requested in step 842. A "move" at address C 
(Move) is next requested in step 844. The instruction in the fetch stage (stage 1) is killed. 
The breakpoint instruction (Brke) is next decoded in step 846. The current pc value is 
updated with the value of lastpc, i.e., the address of BrkB rather than the address of the 
instruction following Move. Move is killed. 
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Next, in step 848, the AdcU instruction completes, and once all other multi cycle 
instructions (including delayed loads) have completed, the processor is halted, reporting a 
break instruction. The host then takes control in step 850, changing Brks to Adde (such 
as by a write to main memory). The host then invalidates the memory mapping of 
5 address B by either invalidating the entire cache or invalidating the associated cache line. 
The host then starts the core running again per step 850. 

In step 852, the add instruction at address B (Adds) is fetched using the current 
address in the program counter (currentpc). A move at address C (Move) is again 
requested in step 854. Mov2c is then requested in step 856, and lastly Mov3c is requested 
10 in step 858. 

Example 3 - Stalled Jump and Branches 

Referring now to Fig. 8c, in step 860, the jump instruction J.dA is requested. The 
breakpoint instruction (Brks) is next requested in step 862. Targetc is next requested in 
step 864. The target address is saved in the second operand register or the long immediate 
1 5 register in the illustrated embodiment, although it will be recognized that other storage 
locations may be utilized. 

The breakpoint instruction (Brk B ) is next decoded in step 866. Current pc is updated 
with the value of lastpc, the address of Brke rather than the address of Targetc. As with 
the example of Fig. 8b above, an extra state is added to indicate (i) that a 'breakpoint 

20 restart' is needed, and (ii) if the breakpoint instruction was in a delay slot. The "Jump 
To" instruction J.dA is stalled in stage 3 since, inter alia, it may be a link jump. Once all 
other multi cycle instructions have completed the core is halted, and a break instruction 
reported. In step 868, the host takes control and changes Brk B to Targetc- The host then 
invalidates the memory mapping of address B by either invalidating the entire cache or 

25 invalidating the associated cache line. The host then starts the core running in step 870. 

The add instruction at address B (Adds) is next fetched using the address of the 
currentpc. In step 874, Targetc is requested, using the target address from stage 3 
(execute) of the pipeline. The currentpc address is set equal to the Targetc address. 
Target2c is then requested per step 876, and Target3c is requested per step 878. 
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Note that in the example of Fig. 8c, the breakpoint instruction is disposed in a delay 
slot, but the processor pipeline is stalled. The breakpoint instruction is held for execution 
until the multi-cycle instructions have completed executing. This limitation is imposed to 
prevent leaving the core in a state of partial completion of a multi-cycle instruction 
5 during the breakpoint instruction execution. 

Bypass Logic 

Referring now to Fig. 9, the bypass logic 900 of the present invention comprises 
bypass operand selection logic 902, one or more single cycle functional units 904, one or 

10 more multi-cycle functional units 906, result selection logic 908 operatively coupled to 
the output of the single cycle functional units, a register 910 coupled to the output of the 
result selection logic 908 and the multi-cycle functional units 906, and more multi-cycle 
functional units 912 and result selection logic 914 coupled sequentially to the output of 
the register 910 as part of the second execute stage 920. A second register 918 is also 

15 coupled to the output of the result selection logic 914. A return path 922 connects the 
output of the second stage result selection logic 914 to the input of a third "multi- 
function" register 924, the latter providing input to aforementioned bypass operand 
selection logic 902. A similar return path 926 is provided from the output of the first 
stage result selection logic 908 to the input of the third register 924. As used herein, the 

20 term "single-cycle" refers to instructions which have only one execute stage, while the 
term "multi-cycle" refers to instructions having two or more execute stages. Of particular 
interest are the instructions that are multi-cycle by virtue of a need to load long 
immediate data. These instructions are formed, e.g., by two sequential instruction words 
in the instruction memory. The first of the words generally includes the op-code for the 

25 instruction, and potentially part of the long immediate data. The second word is made up 
of all (or the remainder) of the long immediate data. 

By employing the bypass arrangement of Fig. 9, the present invention replaces the 
register or memory location used in prior art systems such as that illustrated in Fig. 2 with 
a special register 924 that serves multiple purposes. When used in a "bypass" mode, the 

30 special register 924 retains the summation result and presents the summation result as a 
value to be selected from by an instruction. The result is a software loop that can execute 
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nearly as fast as custom-built hardware. The execution pipeline fills with the instructions 
to perform the sum of products operation and the bypass logic permits the functional 
units to operate at peak speed without any additional addressing of memory. Other 
functions of this register 924 (in addition to the aforementioned "bypass" mode 
5 operation) include (i) latching the source operands to permit fully static operation, and (ii) 
providing a centralized location for synchronization signal/data movement. 

As can be seen from Fig. 9, the duration for single cycle instructions in the 
present embodiment of the pipeline is unchanged as compared to that for the prior art 
arrangement (Fig. 2); however, multi-cycle instructions benefit from the pipeline 
10 arrangement of the present invention by effectively removing the bypass logic during the 
last cycle of the multi-cycle execution. Note that in the case of single cycle instructions, 
the bypass logic is not on the critical path because the datapath is sequenced to permit 
y delay-free operation. By moving the latches (register) 924 to the front of the datapath as 

{ft in Fig. 9, the second and subsequent cycles required for instruction execution are 

Jjj 15 provided with additional time. This additional time comes from the fact that there are no 
Q additional decoding delays associated with the logic for the functional units and operand 

Q\ selection, and because the register 924 may be clocked by a later pipeline stage. Since a 

!U later stage clock signal may be used to clock the register, the register latching is 

W accomplished prior to the clock signal associated with the operand decode logic. Hence, 

S 20 the operand decode logic is not "left waiting" for the latching of the register 924. 

In one exemplary design of the ARC™ core incorporating the bypass logic 
functionality of the invention as described above with respect to Fig. 9, the decode logic 
900 and functional units 904, 906 are constrained to be minimized simultaneously. This 
constraint during design synthesis advantageously produces one fewer level of gate delay 
25 in the datapath as compared to the design resulting if such constraint is not imposed, 
thereby further enhancing pipeline performance. It will be appreciated that this 
refinement is not neceaasry to practice the essence of the invention, but serves to further 
the perfromance enhancement of the invention. 

The results of the previous operation (specifically, in the forgoing sum-of- 
30 products example, the sum from a given iteration) are provided to the multi-function 
register 924 which in turn provides the sum value directly to the input of the bypass 
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operand selection logic 902. In this fashion, the bypass operand selection logic 902 is not 
required to access a memory location or another register repeatedly to provide the 
operands for the summation operation. 

It is also noted that the present invention may advantageously be implemented 
5 "incrementally" by moving lesser amounts of the bypass logic to the execution stage 
(e.g., stage 3). For example, rather than moving all bypass logic to stage 3 as described 
above, only the logic associated with bypassing of late arriving results of functional units 
can be moved to stage 3. It will be appreciated that differing amounts of logic 
optimization will be obtained based on the amount of bypass logic moved to stage 3. 

10 In addition to the structural improvement in performance as previously described 

(i.e., obviating memory/register accesses during each iteration of multi-cycle instructions, 
thereby substantially reducing the total number of memory/register accesses performed 
during any given iterative calculation), there are several additional benefits provided by 
employing the bypass logic arrangement of the present invention. One such benefit is that 

15 by removing the interposed register between the bypass operand selection and the 
functional units (shown in Fig. 2), design compilers can better optimize the generated 
logic to maximize speed and/or minimize the number of gates in the design. Specifically, 
the design compiler does not have to consider and account for the presence of the register 
interposed between the bypass operand selection logic and the single/multi-cycle 

20 functional units. 

Another benefit is that by grouping the registers and logic in the improved fashion 
of Fig. 9, the bypass function is better isolated from the rest of the design. This makes 
VHDL simulations potentially execute faster and simplifies fault analysis and coverage. 
In sum, two primary benefits are derived from the improved bypass logic design 

25 described above. The first benefit is the ability to manage late arriving results from the 
functional units more efficiently. The second benefit is that there is better logic 
optimization within the device. 

The first benefit may be obtained by only moving the minimum required portion 
of the logic to the improved location. The second benefit may be attained in varying 

30 degrees by the amount of logic that is moved to the new location. This second benefit 
derives at least in part from the synthesis engine's improved ability to optimize the 
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results. The ability to optimize the results stems from the way in which the exemplary 
synthesis engine functions. In specific, synthesis engines generally treat all logic between 
registers as a single block to be optimized. Blocks that are divided by registers are 
optimized only to the registers. By moving the operand selection logic so that no registers 
are interposed between it and the functional unit logic, the synthesis engine can perform a 
greater degree of optimization. 

More detail on the design synthesis process incorporating the bypass logic of the 
present invention is provided herein with respect to Fig. 13. 

Referring now to Fig. 10, a method for operating the pipeline of a pipelined 
processor which facilitates the bypass of various components and registers so as to 
maximize processor performance during iterative operations (e.g., sum of products) is 
disclosed. The first step 1002 of the method 1000 comprises providing a multi-function 
register 914 such as that described with respect to Fig. 9 above. This register is defined 
in step 1004 to include a "bypass mode", wherein during such bypass mode the register 
maintains the result of a multi-cycle scalar operation therein. In this fashion, the bypass 
operand selection logic 902 is not required to access memory or another location to 
obtain the operand (e.g., Sum value) used in the iterative calculation as in prior art 
architectures. Rather, the operand is stored by the register 914 for at least a part of one 
cycle, and provided directly to the bypass operand selection logic using decode 
information from the instruction to select register 914 directly without the need for any 
address generation. This type of register access differs from the general purpose register 
access present in RISC CPUs in that no address generation is required. General purpose 
register access requires register specification and/or address generation which consumes a 
portion of an instruction cycle and requires the use of the address generation resource of 
the CPU. The register employed in the bypass logic is an "implied" register that is 
specified by the instruction being executed without the need for a separate register 
specification. For certain instructions the registers of the datapath may function the same 
as an accumulator or other register. The value stored in the datapath register is transferred 
to a general purpose register during a later phase of the pipeline operation. In the 
meantime, iteration or other operations continue to be processed at full speed. 
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Next, in step 1006, a multi-cycle scalar operation is performed by the processor a 
first time. In the foregoing example of the sum-of-products calculation, such an operation 
comprises one iteration of the "Multiply" and "Sum" sub-operations, the result of the 
Sum sub-operation being provided back to the multi-function register 914 per step 1008 
5 for direct use in the next iteration of the calculation. 

In step 1010, the result of the previous iteration is provided directly from the 
register 914 to the bypass operand selection logic 902 via a bus element. 

Lastly, a second iteration of the operation is performed using the result of the first 
operation from the register 914, and another operand supplied by the address generation 
10 logic of the RISC CPU. The iterations are continued until the multi-cycle operation is 
completed (step 1011), and the program flow stopped or other wise continued (step 
1012). 

Data Cache Integration 

15 Integration of the data cache can have a profound effect on the speed of the 

processor. In general, the modified control of the data cache according to the present 
invention is accomplished through data hazard control logic modifications. The following 
discussion describes several enhancements to the prior art data cache integration scheme 
of Fig. 3 made by the present invention, including (i) assumption of data cache "hit" 

20 unless a "miss" actually occurs; (ii) improved instruction request address generation; and 
(iii) relocation of bypass logic from stage 2 (decode) to stage 3 (execute). It should also 
be noted that some of these modifications provide other benefits in the operation of the 
core in addition to improved pipeline performance, such as lower operating power, 
reduced memory accesses, and improved memory performance. 

25 Referring now to Figs. 1 1 and 1 la, the improved data cache structure and method 

of the present invention is described in further detail. 

One embodiment of the improved data cache architecture is shown in Fig. 11, in 
the context of the multi-stage pipeline of the aforementioned ARC™ RISC processor. 
The architecture 1100 comprises a data cache 1102, bypass operand selection logic 1104 

30 (decode stage), result selection logic 1106 (2 logic levels), latch control logic 1108 (2 
levels), program counter (nextpc) address selection logic 1110 (2 levels), and cache 
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address selection logic 1112 (2 levels), each of the logic units 1106, 1 108, 1112 
operatively supplying a third stage latch (register) 1116 disposed at the end of the second 
execution stage (E2) 1118. Summation logic 1111 is also provided which sums the 
outputs of the bypass operand selection logic 1104 prior to input to the multiplexers 

5 1120, 1122 in the data cache 1102. 

In addition to the multiplexers 1120, 1122, the data cache 1102 comprises a 
plurality of data random access memory (RAM) devices 1126 (0 through w-1), further 
having two sets of associated tag RAMs 1 127 (0 through w-1) as shown. As used herein, 
the variable "w" represents the number of ways that a set associative cache may be 

10 searched. In general, w corresponds to the width of the memory array in multples of a 
word. For example, the memory may be two words wide (w=2) and the memory is then 
divided into two banks for access. The output of the data RAMs 1126 is multiplexed 
using a (w-1) channel multiplexer 1131 to the input of the byte/word/long word 
extraction logic 1132, the output of which is the load value 1134 provided to the result 

1 5 selection logic 1 106. The output of each of the tag RAMs 1 127 is logically ORed with 
the output of the summation logic 1 1 1 1 in each of the 0 through w-1 memory units 1138. 
The outputs of the memory units 1138 are input in parallel to a logical "OR" function 
1 1 39 which determines the value of the load valid (ldvalid) signal 1 140, the latter being 
input to the latch control logic 1 108 prior to the third stage latch 1116. 

20 In comparison to the prior art arrangement of Fig. 3 previously described, the 

present embodiment has relocated the bypass operand selection logic from the decode 
stage (and E2 stage) of the pipeline to the first execute stage (El) as shown in Fig. 11. 
Additionally, the nextpc address selection logic 1110 receives the load value immediately 
after the data cache multiplexer 1131, as opposed to receiving the load value after the 

25 results selection logic as in Fig. 3. The valid signal for returning loads (ldvalid) 1140 is 
also routed directly to the two-level latch control logic 1108, versus to the pipeline 
control and hazard detection logic as in Fig. 3. 

The foregoing modifications provide the following functionality: 

(i) Assumption of data cache "hit" - In contrast to the prior art approach of Figs. 3 
30 and 3a, the ldvalid signal 1140 does not influence pipeline movement in the present 
invention, since it is decoupled from the control logic and hazard detection logic. Rather, 
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it is assumed that the data cache will "hit", and therefore the pipeline will continue to 
move. If this assumption is wrong, and the requested dataword is needed by an execution 
unit in the execution stage (El or E2), the pipeline is stalled at that point When the data 
cache 1102 makes the dataword available to the execution unit in need thereof, the 
5 operand for the instruction in the decode stage is updated. 

(ii) Instruction Request Address Generation - Word or byte extracted load results 
do not usually generate the instruction request address for a jump register indirect 
instruction (e.g., j [rx]). Therefore, as part of the present invention, the instruction request 
address is generated earlier by the next address selection logic of figure 11, and a jump 

10 register indirect address where the register value is bypassed from a load byte or word 
causes a structural pipeline stall. 

(iii) Relocation of Bypass Logic - As illustrated in Fig. 1 1, the present invention 
also relocates the bypass operand selection logic from stage 2 (decode) to stage 3 
(execute El), and from execute E2 to El, to allow the multi-cycle/multi-stage functional 

1 5 units cache extra time on all cycles but the first. 

Fig. 11a graphically illustrates the movement of the pipeline of an exeplary 
processor configured with the data cache integration improvements of the present invention. 
Note that the un-dashed bypass arrow 1 170 indicates prior art bypass logic operation, while 
the dashed bypass arrow 1172 indicates bypass logic if it is moved from stage 2 to 3 

20 according to the present invention. The following provides and explanation of the operation 
of the data cache of Fig. 11a. 

In step 1 174, a load (Ld) is requested. Next, a Mov is requested per step 1 176. An 
Add is then requested per step 1178. In step 1180, the Ld begins to execute. In step 1182, 
the Mov begins to execute, and the cache misses. The Mov operation moves through the 

25 pipeline per step 1 184. The Add operation stalls in execute stage El, since the cache missed 
and the Add is dependent on the cache result. The cache then returns the Load Result Value 
per step 1 1 86, and the Add is computed per step 1188. The Add moves through the pipeline 
per step 1 1 90, the Add result is written back per step 1 1 92. 
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As illustrated in Fig. 11a, the improved method of data cache integration of the 
present invention reduces the number of stalls encountered, as well as the impact of a cache 
"miss" (i.e., condition where the instruction is not cached in time) during the execution of 
the program. The present invention results in the add instruction continuing to move through 
the pipeline until reference 'f saving instruction cycles. Further, by delaying pipeline stalls, 
the overall performance of the processor is increased. 

Method of Enhancing Performance of Processor Design 

Referring now to Fig. 12, a method of enhancing the performance of a digital 
processor design such as the extensible ARC™ of the Assignee hereof is described. As 
illustrated in Fig. 12, the method generally comprises first providing a processor design 
which is non-optimized (step 1202), including inter alia critical path signals which 
unnecessarily delay the operation of the pipeline of the design. For example, the non- 
optimized prior art pipeline(s) of Figs. 1 through 4a comprises such designs, although 
others may clearly be substituted. In the present embodiment of the method, the processor 
design further includes an instruction set having at least one breakpoint instruction, for 
reasons discussed in greater detail below. 

Next, in step 1204, a program comprising a sequence of at least a portion of the 
processor's instruction set (including for example the aforementioned breakpoint 
instruction) is generated. The breakpoint instruction may be coded within a delay slot as 
previously described with respect to Fig. 8a herein, or otherwise. 

Next, in step 1206, a critical path signal within the processing of program within 
the pipeline is identified. In the illustrated embodiment, the critical path is associated 
with the decode and processing of the breakpoint instruction. The critical path is 
identified through use of a simulation running a simulation program such as the 
"Viewsim™" program manufactured by Viewlogic Corporation, or other similar 
software. Fig. 4a illustrates the presence of a critical path signal in the dataword address 
(e.g., nextpc) generation logic of a typical processor pipeline. 

Next, in step 1208 the architecture of the pipeline logic is modified to remove or 
mitigate the delay effects of the non-optimized pipeline logic architecture. In the 
illustrated embodiment, this modification comprises (i) relocating the instruction decode 
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logic to the second (decode) stage of the pipeline as previously described with reference 
to Fig. 8, and (ii) including logic which resets the program counter (pc) to the breakpoint 
address, as previously described. 

The simulation is next re-run (step 1210) with the modified pipeline configuration 
5 to verify the operability of the modified pipeline, and also determine the impact (if any) 
on pipeline operation speed. The design is then re-synthesized (step 1212) based on the 
foregoing pipeline modifications. The foregoing steps (i.e., steps 1206, 1208, 1210, and 
1212, or subsets thereof) are optionally re-performed by the designer (step 1214) to 
further refine and improve the speed of the pipeline, or to optimize for other core 
10 parameters. 

Method of Synthesizing 

5 Referring now to Fig. 13, the method 1300 of synthesizing logic incorporating the 

<S long instruction word functionality previously discussed is described. The generalized 

Ifl 15 method of synthesizing integrated circuit logic having a user-customized (i.e., "soft") 
9 instruction set is disclosed in Applicant's co-pending U.S. Patent Application Serial No. 

ill 09/418,663 entitled "Method And Apparatus For Managing The Configuration And 

Functionality Of A Semiconductor Design" filed October 1 4, 1 999, which is incorporated 
herein by reference in its entirety. 
j| 20 While the following description is presented in terms of an algorithm or computer 

It program running on a microcomputer or other similar processing device, it can be 

appreciated that other hardware environments (including minicomputers, workstations, 
networked computers, "supercomputers", and mainframes) may be used to practice the 
method. Additionally, one or more portions of the computer program may be embodied 
25 in hardware or firmware as opposed to software if desired, such alternate embodiments 
being well within the skill of the computer artisan. 

Initially, user input is obtained regarding the design configuration in the first step 
1302. Specifically, desired modules or functions for the design are selected by the user, and 
instructions relating to the design are added, subtracted, or generated as necessary. For 
30 example, in signal processing applications, it is often advantageous for CPUs to include a 
single "multiply and accumulate" (MAC) instruction. In the present invention, the 
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instruction set of the synthesized design is further modified so as to incorporate the desired 
aspects of pipeline performance enhancement (e.g. "atomic" instruction word) therein. 

The technology library location for each VHDL file is also defined by the user in 
step 1302. The technology library files in the present invention store all of the information 

5 related to cells necessary for the synthesis process, including for example logical function, 
input/output timing, and any associated constraints. In the present invention, each user can 
define his/her own library name and location(s), thereby adding further flexibility. 

Next, in step 1303, the user creates customized HDL functional blocks based on the 
user's input and the existing library of functions specified in step 1302. 

10 In step 1304, the design hierarchy is determined based on user input and the 

aforementioned library files. A hierarchy file, new library file, and makefile are 
subsequently generated based on the design hierarchy. The term "makefile" as used herein 
refers to the commonly used UNIX makefile function or similar function of a computer 
system well known to those of skill in the computer programming arts. The makefile 

1 5 function causes other programs or algorithms resident in the computer system to be executed 
in the specified order. In addition, it further specifies the names or locations of data files and 
other information necessary to the successful operation of the specified programs. It is 
noted, however, that the invention disclosed herein may utilize file structures other than the 
"makefile" type to produce the desired functionality. 

20 In one embodiment of the makefile generation process of the present invention, 

the user is interactively asked via display prompts to input information relating to the 
desired design such as the type of "build" (e.g., overall device or system configuration), 
width of the external memory system data bus, different types of extensions, cache 
type/size, etc. Many other configurations and sources of input information may be used, 

25 however, consistent with the invention. 

In step 1306, the user runs the makefile generated in step 1304 to create the 
structural HDL. This structural HDL ties the discrete functional block in the design 
together so as to make a complete design. 

Next, in step 1308, the script generated in step 1306 is run to create a makefile for 

30 the simulator. The user also runs the script to generate a synthesis script in step 1308. 
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At this point in the program, a decision is made whether to synthesize or simulate 
the design (step 1310). If simulation is chosen, the user runs the simulation using the 
generated design and simulation makefile (and user program) in step 1312. Alternatively, 
if synthesis is chosen, the user runs the synthesis using the synthesis script(s) and 

5 generated design in step 1314. After completion of the synthesis/simulation scripts, the 
adequacy of the design is evaluated in step 1316. For example, a synthesis engine may 
create a specific physical layout of the design that meets the performance criteria of the 
overall design process yet does not meet the die size requirements. In this case, the 
designer will make changes to the control files, libraries, or other elements that can affect 

10 the die size. The resulting set of design information is then used to re-run the synthesis 
script. 

If the generated design is acceptable, the design process is completed. If the 
design is not acceptable, the process steps beginning with step 1302 are re-performed 
until an acceptable design is achieved. In this fashion, the method 1300 is iterative. 

15 Fig. 14 illustrates an exemplary pipelined processor fabricated using a 1.0 urn 

process. As shown in Fig. 14, the processor 1400 is an ARC™ microprocessor-like CPU 
device having, inter alia, a processor core 1402, on-chip memory 1404, and an external 
interface 1406. The device is fabricated using the customized VHDL design obtained 
using the method 1300 of the present invention, which is subsequently synthesized into a 

20 logic level representation, and then reduced to a physical device using compilation, 
layout and fabrication techniques well known in the semiconductor arts. For example, the 
present invention is compatible with 0.35, 0.18, and 0.1 micron processes, and ultimately 
may be applied to processes of even smaller or other resolution. An exemplary process 
for fabrication of the device is the 0.1 micron "Blue Logic" Cu-11 process offered by 

25 International Business Machines Corporation, although others may be used. 

It will be appreciated by one skilled in the art that the processor of Figure 14 may 
contain any commonly available peripheral such as serial communications devices, 
parallel ports, timers, counters, high current drivers, analog to digital (A/D) converters, 
digital to analog converters (D/A), interrupt processors, LCD drivers, memories and other 

30 similar devices. Further, the processor may also include custom or application specific 
circuitry, including an RF transceiver and modulator (e.g., Bluetooth™ compliant 2.4 
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GHz transceiver/modulator), such as to form a system on a chip (SoC) device useful for 
providing a number of different functionalities in a single package. The present invention 
is not limited to the type, number or complexity of peripherals and other circuitry that 
may be combined using the method and apparatus. Rather, any limitations are imposed 
5 by the physical capacity of the extant semiconductor processes which improve over time. 
Therefore it is anticipated that the complexity and degree of integration possible 
employing the present invention will further increase as semiconductor processes 
improve. 

It is also noted that many IC designs currently use a microprocessor core and a 

10 DSP core. The DSP however, might only be required for a limited number of DSP 
functions, or for the IC's fast DMA architecture. The invention disclosed herein can 
support many DSP instruction functions, and its fast local RAM system gives immediate 
access to data. Appreciable cost savings may be realized by using the methods disclosed 
herein for both the CPU & DSP functions of the IC. 

15 Additionally, it will be noted that the methodology (and associated computer 

program) as previously described herein can readily be adapted to newer manufacturing 
technologies, such as 0.18 or 0.1 micron processes (e.g. "Blue Logic™" Cu-11 process 
offered by IBM Corporation), with a comparatively simple re-synthesis instead of the 
lengthy and expensive process typically required to adapt such technologies using "hard" 

20 macro prior art systems. 

Referring now to Fig. 15, one embodiment of a computing device capable of 
synthesizing logic structures capable of implementing the pipeline performance 
enhancement methods discussed previously herein is described. The computing device 
1500 comprises a motherboard 1501 having a central processing unit (CPU) 1502, 

25 random access memory (RAM) 1504, and memory controller 1505. A storage device 
1506 (such as a hard disk drive or CD-ROM), input device 1507 (such as a keyboard or 
mouse), and display device 1508 (such as a CRT, plasma, or TFT display), as well as 
buses necessary to support the operation of the host and peripheral components, are also 
provided. The aforementioned VHDL descriptions and synthesis engine are stored in the 

30 form of an object code representation of a computer program in the RAM 1504 and/or 
storage device 1506 for use by the CPU 1502 during design synthesis, the latter being 
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well known in the computing arts. The user (not shown) synthesizes logic designs by 
inputting design configuration specifications into the synthesis program via the program 
displays and the input device 1507 during system operation. Synthesized designs 
generated by the program are stored in the storage device 1506 for later retrieval, 
displayed on the graphic display device 1508, or output to an external device such as a 
printer, data storage unit, fabrication system, other peripheral component via a serial or 
parallel port 1512 if desired. 

It will be recognized that while certain aspects of the invention are described in 
terms of a specific sequence of steps of a method, these descriptions are only illustrative 
of the broader methods of the invention, and may be modified as required by the 
particular application. Certain steps may be rendered unnecessary or optional under 
certain circumstances. Additionally, certain steps or functionality may be added to the 
disclosed embodiments, or the order of performance of two or more steps permuted. All 
such variations are considered to be encompassed within the invention disclosed and 
claimed herein. 

While the above detailed description has shown, described, and pointed out novel 
features of the invention as applied to various embodiments, it will be understood that 
various omissions, substitutions, and changes in the form and details of the device or 
process illustrated may be made by those skilled in the art without departing from the 
invention. The foregoing description is of the best mode presently contemplated of 
carrying out the invention. This description is in no way meant to be limiting, but rather 
should be taken as illustrative of the general principles of the invention. The scope of the 
invention should be determined with reference to the claims. 



-36- 



APPENDIX I -HDL DESCRIPTION 
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This file has been redacted 



Confidential Information 
Limited Distribution to Authorized Persons Only 
Created 1996 and Protected as an Unpublished Work 
Under the U.S. Copyright Act of 1976. 
Copyright © 1996 - 2001 ARC CORES LTD. 
All Rights Reserved. 



— Inputs and Outputs: 



L indicates a latched signal, U indicates an signal produced by logic. 
Stage 1 - Opcode fetch 



in pliw[31:0] U The instruction word supplied by the memory controller. 

It is considered to be valid when the ivalid signal is 



in ivalid 



in ivic 



three 
generates 



out pcen 



out i fetch 



true . 

U Qualifying signal for pliw[31:0]. When it is low, this 
indicates that the m/c has not been able to fetch the 
requested opcode, and that the program counter should not 
be incremented. The pipeline might be stalled, depending 
upon whether the instruction in stage 2 needs to look at 
the instruction in stage 1. 

When it is true, the instruction is clocked into 
pipeline stage 2 provided that the pipeline is able to 
move on . 

U Indicates that all values in the cache are to be 
invalidated, (it stands for Invalidate Instruction Cache) . 
It is anticipated that this signal will be generated from a 
decode of an SR instruction. 

Note that due to the pipelined nature of the ARC, up to 

instructions could be issued following the SR which 

the ivic signal. Cache invalidates must be suppressed when 
line is being loaded from memory. This is done at the 
auxiliary register which generates ivic. 

U Program counter enable. When this signal is true, the pc 
will change at the end of the cycle, indicating that the 
memory controller needs to do a fetch on the next cycle 
using the address which will appear on currentpc[], which 
is supplied from aux_regs . vhd. 

This signal is affected by interrupt logic and all the 
other pipeline stage enables. 

U This signal, similar to pcen, indicates to the memory 
controller that a new instruction is required, and should 
be fetched from memory from the address which will be 
clocked into currentpc [25 : 2 ] at the end of the cycle. It 
is also true for one cycle when the processor has been 
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started following a reset, in order to get the ball 
rolling. 

An instruction fetch will also be issued if the host 
changes the program counter when the ARC is halted, 
provided it is not directly after a reset. 

The ifetch signal will never be set true whilst the 
memory controller is in the process of doing an 
instruction fetch, so it may be used by the memory 
controller as an acknowledgement of instruction receipt. 

out ipending U This signal is true when an instruction fetch has been 
issued, and it has not yet completed. It is not true 
directly after a reset before the ARC has started, as no 
instruction fetch will have been issued. It is used to 
hold off host writes to the program counter when the ARC 
is halted, as these accesses will trigger an instruction 
fetch. 

-- out plint U indicates that an interrupt has been detected, and an 

interrupt-op will be inserted into stage 2 on the next 
cycle, (subject to pipeline enables) setting p2int true. 

This signal will have the effect of canceling the 
instruction currently being fetched by stage 1 by causing 
p2iv to be set false at the end of the cycle when plint 
is true. 

~ out enl U Stage 2 pipeline latch control. True when an instruction 

is being latched into pipeline stage 2. Will be true 
at different times to pcen, as it allows junk instructions 
to be latched into the pipeline. 

*** A feature of this signal is that it will allow an 
instruction be clocked into stage 2 even when stage 3 
is halted, provided that stage 2 contains a killed 
instruction (i.e. p2iv = '0')- This is called a 
'catch-up 1 . *** 



Stage 2 - Operand fetch 



out en2 U Pipeline stage 2 enable. When this signal is true, the 

instruction in stage 2 can pass into stage 3 at the end 
of the cycle. When it is false, it will hold up stage 2 
and stage 1 (pcen) . 

out p2i[4:0] L Opcode word. This bus contains the instruction word 

which is being executed by stage 2. It must be qualified 
by p2iv. 

— out p2iv L Opcode valid. This signal is used to indicate that the 

opcode in pipeline stage 2 is a valid instruction. The 
instruction may not be valid if a junk instruction has 
been allowed to come into the pipeline in order to allow 
the pipeline to continue running when an instruction 
cannot be fetched by the memory controller. 

— out fsla[5:0] L Source 1 register address. This is the B field from the 

instruction word, sent to the core registers (via hostif) 
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and the LSU. It is qualified for LSU use by slen. 



-- out s2a[5:0] 



— out dest[5:0] 



out slen 



out s2en 



L Source 2 register address. This is the C field from the 
instruction word, sent to the core registers and the 
LSU. It is qualified for LSU use by s2en. 

L Destination register address. This is the A field from 
the instruction word, send to the LSU for register 
scoreboarding of loads. It is qualified by the desten 
signal . 

U This signal is used to indicate to the LSU that the 
instruction in pipeline stage 2 will use the data from 
the register specified by fsla[5:0]. If the signal is 
not true, the LSU will ignore fsla[5:0]. This signal 
includes p2iv as part of its decode. 

U This signal is used to indicate to the LSU that the 
instruction in pipeline stage 2 will use the data from 
the register specified by s2a[5:0]. If the signal is 
not true, the LSU will ignore s2a[5:0]. This signal 
includes p2iv as part of its decode. 

U From extensions. This signal is used to hold up pipeline 
stages 1 and 2 (pcen, enl and en2) when extension logic 
requires that stage 2 be held up. For example, a core 
register is being used as a window into SRAM, and the SRAM 
is not available on this cycle, as a write is taking place 
from stage 4, the writeback stage. Hence stage 2 must be 

to allow the write to complete before the load can happen. 
Stages 3 and 4 will continue running. 

U This signal is used to indicate to the LSU that the 
instruction in pipeline stage 2 will use the data from 
the register specified by dest[5:0]. If the signal is 
not true, the LSU will ignore dest[5:0]. This signal 
includes p2iv as part of its decode. 

out p2off set [19:0] L This bus carries the region of the instruction which 

contains the branch offset. It is used by the program 

counter generation logic when the instruction in stage 2 
is a Bcc/BLcc or LPcc . 



in xholdupl2 



held 



out desten 



out p2condtrue 



U This signal is produced from the result of the internal 
stage 2 condition code unit or from an extension cc unit 
(if implemented) . A bit (bit 5) in the instruction selects 
between the internal and extension cc unit results. 
As stage 2 conditionals are only used by branch and jump 
instructions, the logic to produce this signal is simpler 
than that required from p3condtrue, which takes into 
account the complications presented by short immediate 
data registers, amongst other things. When using 
p2condtrue, a decode for a branch/ jump instruction must 
always be included along with a check for p2iv = T l f . 



— out p2set flags L This is bit 8 from the instruction word at stage 2, 
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in p2int 



out p2jblcc 



out p2st 



out p21do 



out p21r 



out mload2 



out mstore2 



in holdup!2 



i.e. the . F or setflags bit used in the jump instruction. 
It is used in flags.vhd to determine whether the flags 
should be loaded by a jump instruction. The stage 3 
signal p3setflags is much more complicated, having to 
take into account the complications presented by short 
immediate data, amongst other things. 

L This signal indicates that an interrupt jump instruction 
(fantasy instruction) is currently in stage 2. This signal 
has a number of consequences throughout the system, 
causing the interrupt vector (int_vec [25 : 2] ) to be put 
into the PC, and causing the old PC to be placed into 
the pipeline in order to be stored into the appropriate 
interrupt link register. 

Note that p2int and p2iv are mutually exclusive. 

U True when a JLcc or BLcc instruction is in stage 2. 
Does not include p2iv. 

Used in conjunction with the branch delay slot mode which 
is re-created from the short immediate field. 

U This signal is used by coreregs . vhd. It is produced from 
a decode of p2i[4:0], p2iw{25) (check for SR) and 
does not include p2iv. 

U True when p2i[4:0] = oldo, and p2iw(13) = f 0 f , which 
indicates that the instruction is an LDO, not an LR which 
is an encoding of the LDO instruction. 

This signal is used by coreregs. vhd to switch short imm 
data onto a source bus when an LDO instruction is used. 
Does not include p2iv. 

U True when p2i[4:0] = oldo, and p2iw(13) = 1 1', which 
indicates that the instruction is the auxiliary register 
load instruction LR, not a memory load LDO instruction. 
This signal is used by coreregs. vhd to switch the 
currentpc bus onto the source2 bus (which is then passed 
through the same logic as the interrupt link register) 
in order to get the correct value of pc when it is read 
by an LR instruction. 
Does not include p2iv. 

U This signal indicates to the LSU that there is a valid 
load instruction in stage 2. It is produced from a decode 
of p2i[4:0], p2iw(13) (to exclude LR) and the p2iv signal. 

U This signal indicates to the actionpoint mechanism when 
selected that there is a valid store instruction in stage 
2. It is produced from a decode of p2i[4:0], p2iw(13) 
(to exclude SR) and the p2iv signal. 

U From lsu.vhd. This signal is used to hold up pipeline 
stages 1 and 2 (pcen and en2) when the load store unit 
finds a register being used by the instruction at stage 
2 which is the destination of a delayed load. It will 
also be set when the scoreboard unit is full and the 
ARC attempts to do another load. Stages 3 and 4 will 
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will continue running. 



in aluf lags [3:0] L ALU flags, direct from the latches in flags. vhd 



in x__p2noscl 



in x_p2nosc2 



out dorel 



— out dojcc 



out p2killnext 



instruction. 



U From extensions. Indicates that the register referenced 
by fsla[5:0] is not available for shortcutting. This signal 
should only be set true when the register in question is 
an extension core register. This signal is ignored unless 
constant xt_corereg is set true. 

U From extensions. Indicates that the register referenced 
by s2a[5:0] is not available for shortcutting. This signal 
should only be set true when the register in question is 
an extension core register. This signal is ignored unless 
constant xt_corereg is set true. 

U True when a relative branch (not jump) is going to happen. 
Relates to the instruction in p2 . Includes p2iv. 

U True when a jump is going to happen. 

Relates to the instruction in p2 . Includes p2iv. 

U True when the instruction in stage 2 is a branch/jump type 
operation which will kill the following delay slot 

The following operation will be marked invalid when it is 
passed from stage 1 into stage 2. 



Stage 3 



ALU 



out en3 

- out p3i[4:0] 

-- out p3a[5:0] 

— out p3c[5:0] 

out p3iv 



in p3int 



U Pipeline stage 3 enable. When this signal is true, the 
instruction in stage 3 can pass into stage 4 at the end 
of the cycle. When it is false, it will probably hold up 
stages one (pcen) , two (en2), and three. 

L Opcode word. This bus contains the instruction word 
which is being executed by stage 3. It must be 
qualified by p3iv. 

L Instruction A field. This bus carries the region of 
the instruction which contains the operand dest field. 

L Instruction C field. This bus carries the region of 
the instruction which contains the operand C field. This 
is used to encode extra single-operand functions onto 
the FLAG instruction opcode. 

L Opcode valid. This signal is used to indicate that the 
opcode in pipeline stage 3 is a valid instruction. The 
instruction may not be valid if a junk instruction has 
been allowed to come into the pipeline in order to allow 
the pipeline to continue running when an instruction 
cannot be fetched by the memory controller, or when an 
instruction has been killed. 

U This signal indicates that an interrupt jump instruction 
(fantasy instruction) is currently in stage 3. This signal 
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causes (in conjunction with p3ilevl) the appropriate 
interrupt mask bits to be cleared in the status register. 
Note that p3int and p3iv are mutually exclusive. 

in p3ilevl U This is used in conjunction with p3int to indicate which 

level of interrupt is being processed, and hence which of 
the interrupt mask bits should be cleared. 

It comes from bit 7 of the jump instruction word, which is 
set when a levell (lowest level) interrupt is being 
processed. 

— out p3condtrue U This signal is produced from the result of the internal 

stage 3 condition code unit or from an extension cc unit 
(if implemented) . A bit (bit 5) in the instruction selects 
between the internal and extension cc unit results. In 
addition, this signal is set true if the instruction is 
using short immediate data. As it is only used by 
flags.vhd in conjunction with the p3i=oflag, and 
with p3setflags, it does not include a decode for 
instructions which do not have a condition code field 
(i.e. all load and store operations). 
Does not include p3iv. 

— out p3setflags U This signal is used by regular alu-type instructions and 

the jump instruction to control whether the supplied flags 
get stored. It is produced from the set-flags bit in the 
instruction word, but if that field is not present in the 
instruction (e.g. short immediate data is being used) 
then it will either come from the set-flag modes implied 
by which short immediate data register is used, or it will 
be set false if the instruction does not affect the flags. 
Does not include p3iv. 

— out p3cc[3:0] L This bus contains the region of the instruction which 

contains the four-bit condition code field. It is sent 
with the alu flags to the extension condition code test 
logic which provides in return a signal (xp3ccmatch) 
which indicates whether it considers the condition to be 
true. The ARC decides whether to use the internal 
condition-true signal or the signal provided by extensions 
depending on the fifth bit of the instructiorl. This is 
handled within rctl.vhd. 

— in xp3ccmatch U This signal is provided by an extension condition-code 

unit which takes the condition code field from the 
instruction (at stage 3), and the alu flags (from stage 3) 
performs some operation on them and produces this 
condition true signal. Another bit in the instruction word 
indicates to the ARC whether it should use the internal 
condition-true signal or the one provided by the extension 
logic. This technique will allow extra ALU instruction 
conditions to be added which may be specific to different 
implementations of the ARC. 

— out sc regl U This signal is produced by the pipeline control unit rctl, 

~~ and is set true when an instruction in stage 3 is going to 

generate a write to the register being read by source 1 of 
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the instruction in stage 2. This is a source 1 shortcut. 

It is used by the core register module to switch the stage 3 

result bus onto the stage 2 source 1 result. 

Extension core registers can have shortcutting banned 
if x_p2noscl is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lastsl signal is sc_regl and sc_loadl ORed together. 

out sc_loadl U This signal is set true when data from a returning load is 
required to be shortcut onto the stage 2 source 1 result 
bus. This will only be the case if fast-load-returns are 
enabled, or if a four-port register file is used. If the 4p 
register file is implemented, the data used for the shortcut 
comes direct from the memory system, this requiring an 
additional input into the shortcut muxer . 

Extension core registers can have shortcutting banned 
if x_p2noscl is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lastsl signal is sc_regl and sc_loadl ORed together. 

out sc reg2 U This signal is produced by the pipeline control unit rctl, 

and is set true when an instruction in stage 3 is going to 
generate a write to the register being read by source 2 of 
the instruction in stage 2. This is a source 1 shortcut. 
It is used by the core register module to switch the stage 3 
result bus onto the stage 2 source 2 result. 

Extension core registers can have shortcutting banned 
if x_p2nosc2 is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lasts2 signal is sc_reg2 and sc_load2 ORed together. 

out sc_load2 U This signal is set true when data from a returning load is 
required to be shortcut onto the stage 2 source 2 result 
bus. This will only be the case if fast-load-returns are 
enabled, or if a four-port register file is used. If the 4p 
register file is implemented, the data used for the shortcut 
comes direct from the memory system, this requiring an 
additional input into the shortcut muxer. 

Extension core registers can have shortcutting banned 
if x_p2nosc2 is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lasts2 signal is sc_reg2 and sc_load2 ORed together. 

out p3dolink L This signal is latched (with en2) from p2dolink which is 
true when a JLcc or branch-and-link instruction was taken, 
indicating that the link register needs to be stored. It 
is used by alu.vhd to switch the program counter value 
which has been passed down the pipeline onto the p3result 
bus. If this signal is to be used to give a fully qualified 
indication that a J/BLcc is in stage 3, it must be qualified 
with p3iv to take account of pipeline tearing between 
stages 2 and 3 which could cause the instruction in stage 
three to be repeated. 

out p3dolink L This signal is latched (with en2 ) from p2dolink which is 
true when a JLcc or branch-and-link instruction was taken, 
indicating that the link register needs to be stored. It 
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out p31r 



out p3sr 



out mload 



out mstore 



out size[l:0] 



out sex 



out nocache 



out ldvalid wb 



in ldvalid 



is used by alu.vhd to switch the program counter value 
which has been passed down the pipeline onto the p3result 
bus. If this signal is to be used to give a fully qualified 
indication that a J/BLcc is in stage 3 f it must be qualified 
with p3iv to take account of pipeline tearing between 
stages 2 and 3 which could cause the instruction in stage 
three to be repeated. 

U This signal is used by hostif .vhd. It is produced from 
a decode of p3i[4:0], p3iw(13) (check for LR) and 
includes p3iv. Also used in extension logic for separate 
decoding of auxiliary accesses from host and ARC. 

U This signal is used by hostif. vhd. It is produced from 
a decode of p3i[4:0], p3iw(25) (check for SR) and 
includes p3iv. Also used in extension logic for separate 
decoding of auxiliary accesses from host and ARC. 

U This signal indicates to the LSU that there is a valid 
load instruction in stage 3. It is produced from a decode 
of p3i[4:0], p3iw(13) (to exclude LR) and the p3iv signal. 

U This signal indicates to the LSU that there is a valid 
store instruction in stage 3. It is produced from a decode 
of p3i[4:0], p3iw(25) (to exclude SR) and the p3iv signal. 

L This pair of signals are used to indicate to the LSU 
the size of the memory transaction which is being 
requested by a LD or ST instruction. It is produced 
during stage 2 and latched as the size information bits 
are encoded in different places on the LD and ST 
instructions. It must be qualified by the mload/mstore 
signals as it does not include an opcode decode. 

L This signal is used to indicate to the LSU whether 
a sign-extended load is required. It is produced during 
stage 2 and latched as the sign-extend bit in the two 
versions of the LD instruction (LDO/LDR) are in different 
places in the instruction word. 

L This signal is used to indicate to the LSU whether 

the load/store operation is required to bypass the cache. 

It comes from bit 5 of the ld/st control group which is 
found in different places in the ldo/ldr/st instructions. 

U This signal is used to control the switching of returning 
load data onto the writeback path for the register file. 
It is set true whenever returning load data must pass 
through the regular load writeback path - this will be 
loads to r32-r60 for a 4p regfile system, or loads to 
r0-r60 for a 3p regfile system. 

U From LSU. This signal is set true by the LSU to 
indicate that a delayed load writeback WILL occur on 
the next cycle. If the instruction in stage 3 wishes to 
perform a writeback, then pipeline stage 1, 2 and 3 will 
be held. If the instruction is stage 3 is invalid, or 
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in 



does not want to write a value into the core register 
set for some reason, and fast-load-returns are enabled, 
then the instructions in stages 1 and 2 will move into 
2 and 3 respectively, and the instruction that was in 
stage 3 will be replaced in stage 4 by the delayed load 
writeback. 

** Note that delayed load writebacks WILL complete, even 
if the processor is halted (en=0) . In this instance, the 
host may be held off for a cycle (hold_host) if it is 
attempting to access the core registers. ** 

regadr[5:0] U From LSU. This bus carries the address of the register 
into which the delayed load will writeback when ldvalid 
is true, rctl.vhd will ensure that this value is latched 
onto wba[5:0] at the end of a cycle when ldvalid is true, 
even cycles when the processor is halted (en = 0) . 



in mwait 



in xshimm 



in xholdupl23 



U From MC. This signal is set true by the MC in order 
to hold up stages 1, 2, and 3. It is used when the 
memory controller cannot service a request for a memory 
access which is being made by the LSU. It will be 
produced from mload, mstore and logic internal to the 
memory controller. 

U From extensions. Indicates that an extension instruction 
in stage 3 is using short-immediate data other than that 
implied by the use of one of the short-immediate data 
registers. It is used by rctl to ensure correct values for 
p3condtrue and p3setflags are generated. Qualified by 
x_idecode3, xt_aluop and p3iv (eventually) . 

U From extensions. This is used by extension ALU 
instructions to hold up the pipeline if the function 
requested cannot be completed on the current cycle. 
Pipeline stages 1, 2 and 3 will typically be held, but the 
writeback (stage 4) will continue. 

L From extensions. This signal will be true when the 
extension logic detects an extension instruction in 
stage 3. It is latched from x_idecode2 by the extensions 
when en2 is true at the end of a cycle. 

It is used to correctly generate p3condtrue, p3setflags, 
and to detect (along with xnwb) when a register writeback 
will take place. 

U From extensions. Extension instructions utilise the 
normal writeback-control logic (ins.cc's, dest=imm, 
short imm data etc), but in addition have extra functions. 
When the extension logic has 'claimed 1 an instruction 
in stage 3 by setting x_idecode3, it can also disable 
writeback for that instruction by setting xnwb. When 
x idecode3 is low, or if the instruction is 'claimed 1 by 
the ARC, xnwb has no effect. 

(x ialusel) U From extensions. This signal is provided to allow 

~ extension instructions to utilise basecase ALU operations 

for their own purposes. This is intended to be used to 



in x idecode3 



in xnwb 
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(not req. by) load up fifo command buffers for pixel engines etc. 

(rctl, here ) The ALU only decodes the bottom four bits of the 

(for info ) instruction opcode directly, and has extra logic to take 

(only. ) account of the x_ialusel signal. 

If extensions want to shadow an ALU operation, the top 
bit is set (ie an extension instruction) , whilst the 
rest of the instruction is set up as per the basecase ALU 
instruction. The ALU result mux will select an internal 
ALU result if i4 = 0 (basecase instruction) , or if 
i4=l (extension), x_ialusel=l (use int. result), 
x_idecode3~l (valid extension instruction) . The extension 
logic should also set xnwb to prevent writeback to the 
core register set. Flag setting will work normally unless 
the xsetflags signal is set, in which case the flags 
will be loaded from the xf lags [3:0] bus. 

xp2idest should be set when the instruction is in stage 2 
to prevent the scoreboard unit from checking the dest 
register field. 



Breakpoint/Actionpoint signals 



in actionhalt This signal is set true when the 

actionpoint (if selected) has been triggered by a valid 
condition. The ARC pipeline is halted and flushed when 
this signal is '1' . 

Note: The pipeline is flushed of instructions when the 
breakpoint instruction is detected, and it is important 
to disable each stage explicitly. A normal instruction in 
stage one will mean that instructions in stage two, three 
and four will be allowed to complete. However , for an 
instruction in stage one which is in the delay slot of a 
branch, loop or jump instruction means that stage two 
has to be stalled as well. Therefore, only stages three 
and four will be allowed to complete. 

— out brk_inst U To flags.vhd. This signals to the ARC that a breakpoint 

instruction has been detected in stage one of the 
pipeline. Hence, the halt bit in the flag register has to 
be updated in addition to the BH bit in the debug 
register. The pipeline is stalled when this signal is set 
to *1' . 



Note: The pipeline is flushed of instructions when the 
breakpoint instruction is detected, and it is important 
to disable each stage explicitly. A normal instruction in 
stage one will mean that instructions in stage two, three 
and four will be allowed to complete. However, for an 
instruction in stage one which is in the delay slot of a 
branch, loop or jump instruction means that stage two 
has to be stalled as well. Therefore, only stages three 
and four will be allowed to complete. 

— out AP_p3disable_r This signals to the ARC that the 

pipeline has been flushed due to a breakpoint or sleep 
instruction. If it was due to a breakpoint instruction 
the ARC is halted via the 'en 1 bit, and the AH bit is 
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set to '1' in the debug register. 



out p21imm This is used by the actionpoint 

debugging system when selected to qualify the value of 
the PC at stage one of the pipeline. The limna data is 
considered to be at the same the value address as the 
instruction it is associated with regards to the 
debugger. 



Sleep Mode signals 



in sleeping 



This is the sleep mode flag ZZ in the debug register 
(bit 23) . When it is true the ARC is stalled. This flag 
is set when the p2sleep__inst is true and 
cleared on restart or interrupt. 



out p2sleep_inst This signal is set when a sleep instruction has been 

decoded in pipeline stage 2. It is used to set the sleep 
mode flag ZZ (bit 23) in the debug register. 



Instruction Step signals 



in do_inst_step This signal is set when the single step flag (SS) and the 
instruction step flag (IS) in the debug register has been 
written to simultaneously through the host interface. It 
indicates that an instruction step is being performed. 
When the instruction step has finished this signal goes 
low. 



out stop step 



This signal is set when the instruction step has 
finished. 



ENTITY rctl IS 



PORT ( 



signal ck 
signal clr 
signal en 



in std_ulogic; 
in std_ulogic; 
in std__ulogic; 



— system clock 
system reset 
system go 



Stage 1 



signal 


pliw 


: in 


std ulogic vector (31 downto 0) ; 


signal 


ivalid 


: in 


std_ulogic; 


signal 


ivic 


: in 


std ulogic; 


signal 


pcen 


: out 


std__ulogic; 


signal 


if etch 


: out 


std_ulogic; 


signal 


ipending 


: out 


std ulogic; 


signal 


enl 


: out 


std_ulogic; 


signal 


plint 


: in 


std_ulogic; 



Stage 2 



signal en2 

signal p2i 

signal p2iv 

signal fsla 



out std_ulogic; 

out std_ulogic_vector (4 downto 0) ; 
out std_ulogic; 

out std ulogic vector (5 downto 0); 
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signal 


sza : 


out 


std ulogic vector (5 


downto 


U) ; 


signal 


dest ; 


out 


siu uiogic veoLor \ o 


Q.O WI1XO 




signal 


si en : 


out 


std_ulogic; 






signal 


s26n 


out 


std ulogic; 






s ignal 


xnoiuupi z 


in 


std_ulogic; 






signal 




out 


std_ulogic; 






s ignal 


xp2idest : 


in 


std ulogic; 






sigiidi 




m 


std_ulogic; 






signal 


p2shiimri 


out 


std_ulogic_vector (8 


downto 


0); 


signal 


pzoriset 


out 


std ulogic_vector (19 downto 0) 


signal 


p2cc : 


out 


q+'H Til on"! c vAfhnr ( ^ 


<JL\J Wll LU 




s i gnai 


xpzcciua Lcn 


m 


std__ulogic; 






signal 


pzouiiu.LX.uc t 


out 


std ulogic; 






s ignal 




out 


std_ulogic; 






signal 


pz mi^ 


m 


std ulogic; 










out 


std_ulogic; 






signal 


pZSL 


out 


std__ulogic; 






signal 


pZlQO 


out 


std ulogic; 






signal 


pzir 


out 


std_ulogic; 






signal 


mload.2 


out 


std_ulogic; 






signal 




out 


std ulogic; 






signal 


holciupl2 


in 


std_ulogic; 






signal 


aluf lags 


in 


std ulogic vector (3 


downto 


0) ; 


signal 


x p2noscl 


in 


std ulogic; 






signal 


x_p2nosc2 


in 


std_ulogic; 






signal 


dorel 


> out 


std ulogic; 






signal 


do jcc 


out 


std__ulogic; 






signal 


p2killnext 


- out 


std ulogic; 







** Stage 3 ** 



signal 


en3 : 


out 


std ulogic; 






signal 


p3i 


out 


std_ulogic_vector ( 4 


downto 


0); 


signal 


p3a 


out 


std_ulogic_vector ( 5 


downto 


0); 


signal 


p3c 


out 


std ulogic vector (5 


downto 


0); 


signal 


p3iv 


out 


std_ulogic; 






signal 


p3int 


in 


std_ulogic; 






signal 


p3ilevl 


in 


std ulogic; 






signal 


p3condtrue 


out 


std ulogic; 






signal 


p3setf lags 


out 


std_ulogic; 






signal 


p3cc 


out 


std_ulogic_vector (3 


downto 


0); 


signal 


xp3ccmatch 


, in 


std ulogic; 






signal lastsl 


out 


std ulogic; 






signal lasts2 


out 


std_ulogic; 






signal 


sc regl 


• out 


std ulogic; 






signal 


sc reg2 


. out 


std ulogic; 






signal 


sc loadl 


: out 


std_ulogic; 






signal 


sc_load2 


. out 


std_ulogic; 






signal 


p3dolink 


• out 


std ulogic; 






signal 


p31r 


: out 


std ulogic; 






signal 


p3sr 


: out 


std ulogic; 






signal 


mload 


: out 


std_ulogic; 






signal 


mstore 


: out 


std ulogic; 






signal 


size 


: out 


std ulogic__vector (1 


downto 


0); 


signal 


sex 


: out 


std_ulogic; 






signal 


nocache 


: out 


std ulogic; 






signal 


Idvalid 


: in 


std ulogic; 
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signal 


regadr 


: in 


std ulogic vector (5 downto 0); 


signal 


mwait 


: in 


std_ulogic; 


signal 


XoXlllUltl 


: in 




signal 


xholdupl23 


: in 


std_ulogic; 


signal 


x_i decode 3 


: in 


std_ulogic; 


signal 


xnwb 


: in 


std ulogic; 


signal 


p3wb en 


: out 


std ulogic; 


signal 


p3wb_nxt 


: out 


std__ulogic; 


signal 


p3wba 


: out 


std_ulogic_vector (5 downto 0) ; 


signal 


p3_ni_wbrq 


: out 


std__ulogic; 


signal 


ldvalid wb 


: out 


std ulogic; 



Debug interface 



signal 


actionhalt 


: in 


std__ulogic; 


signal 


hw brk_only 


: in 


std_ulogic; 


signal 


sleeping 


: in 


std ulogic; 


signal 


do inst step 


: in 


std ulogic; 


signal 


stop^step 


: out 


std_ulogic; 


signal 


p2sleep_inst 


: out 


std ulogic; 


signal 


brk_inst 


: out 


std_ulogic; 


signal 


p21imm 


: out 


std_ulogic; 


signal 


AP p3disable r 


: out 


std ulogic; 



signal p21iiran_data_r : out std_ulogic_vector (31 downto 0) 
signal f etch_rolling_r : in std_ulogic; 
signal p2merge_valid_r : out std_ulogic; 



END rctl; 



Mb 



ARCHITECTURE synthesis OF rctl IS 



internal signals: 



SIGNAL i_ifetch 
SIGNAL ipcen 
SIGNAL ienl 
SIGNAL ienl_lowpower 



std_ulogic; 
std_ulogic; 
std_ulogic; 
std ulogic; 



— for debugging and halting the pipeline stages 



SIGNAL i_brk_decode 
SIGNAL i_brk_inst 
SIGNAL ijorkjpass 
SIGNAL iJcill_AP 
SIGNAL i_break_stagel 
SIGNAL i_breakjstage2 
SIGNAL i_AP_p2disable_r 
SIGNAL i__AP_p3disable_r 
SIGNAL i_n_AP_p 2 disable 



std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std ulogic; 
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SIGNAL i_n_AP_p3disable 
SIGNAL ip2sleep_inst 
signal istop_step 
signal inst^stepping 
signal plp2step 
signal p2step 
signal p3step 
signal pcen__step 

SIGNAL ien2 
SIGNAL ip2iw 
SIGNAL ip2i 
SIGNAL ip2a 
SIGNAL ip2b 
SIGNAL ip2c 
SIGNAL ip2q 
SIGNAL ip2dd 
SIGNAL ip21d 
SIGNAL ip2_fbit 
SIGNAL ip2iv 
SIGNAL ip2ccmatch 
SIGNAL ip2condtrue 
SIGNAL ishi_bf 
SIGNAL ishijon 
SIGNAL ishi_cf 
SIGNAL ishi__cn 
SIGNAL ip2shimm 
SIGNAL ip2shimmf 
SIGNAL islen 
SIGNAL is2en 
SIGNAL idesten 
SIGNAL ip2jblcc 
SIGNAL ip2mop_e 
SIGNAL ip2size 
SIGNAL ip2sex 
SIGNAL ip2awb 
SIGNAL ip2nocache 
SIGNAL ip21do 
SIGNAL ip2st 
SIGNAL ip21imm 
SIGNAL ip2bch 
SIGNAL ip2killnext 
SIGNAL ip2pldep 
SIGNAL ip2rjmp 
SIGNAL ip2 jumping 
SIGNAL ip2nojump 
SIGNAL ip21pcc 
SIGNAL ip2dolink 
SIGNAL ilastsl 
SIGNAL ilasts2 

SIGNAL ien3 
SIGNAL ip3i 
SIGNAL ip3a 
SIGNAL ip3b 
SIGNAL ip3c 
SIGNAL ip3q 



std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 

std_ulogic; 

std_ulogic_vector (31 downto 0) ; 

std_ulogic_jvector (4 downto 0) ; 

std_ulogic_vector (5 downto 0); 

std_^ulogic_vector (5 downto 0) ; 

std_ulogic_vector (5 downto 0) ; 

std_ulogic_vector (4 downto 0) ; 

std_ulogic_yector (1 downto 0) ; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std__ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

st dialogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std__ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic__vector (memop_esz downto 

std__ulogic_vector (1 downto 0); 

std_ulogic; 

std__ulogic; 

std__ulogic; 

std_ulogic; 

std__ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std__ulogic; 

std_ulogic; 

std__ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic_yector (4 downto 0); 

std_ulogic_vector (5 downto 0) ; 

std__ulogic_vector (5 downto 0); 

std__ulogic_yector (5 downto 0); 

std__ulogic_vector (4 downto 0); 
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SIGNAL 


ip3 fbit : 


std__ulogic; 








SIGNAL 


ip3shimm : 


std ulogic; 








SIGNAL 


ip3shimmf : 


std ulogic; 








SIGNAL 


ip3iv : 


std ulogic; 








SIGNAL 


ip3ccmatch : 


std_ulogic; 








SIGNAL 


ip3condtrue : 


std__ulogic; 








SIGNAL 


ip3setflags : 


std_ulogic; 








SIGNAL 


ip3size : 


std ulogic vector (1 downto 


0); 






SIGNAL 


ip3sex : 


std_ulogic; 








SIGNAL 


ip3awb : 


std ulogic; 








SIGNAL 


ip3nocache : 


std_ulogic; 








SIGNAL 


ip3wb en : 


std ulogic; 








SIGNAL 


ip3dolink : 


std_ulogic; 








SIGNAL 


ip3dimm : 


std__ulogic; 








SIGNAL 


imload3 : 


std ulogic; 








SIGNAL 


imstore3 : 


std_jalogic; 








QTrMflT 
O -L oJN.H.J_j 


J_pOI.U dWU • 










SIGNAL 


ip3ccwb_op : 


std ulogic; 








SIGNAL 


ip3xwb op : 


std ulogic; 








SIGNAL 


ip3 wb req : 


std ulogic; 










ip3 wb rSV ! 


slq ulogic^ 








SIGNAL 


ip31r : 


std_ulogic; 








SIGNAL 


ip3sr : 


std ulogic; 








SIGNAL 


ip3wba 


std ulogic vector (5 downto 


0) ; 






SIGNAL 


ip3__sc_wba 


std ulogic vector (5 downto 


0); 




2 s 


SIGNAL 


iwben 


std ulogic; 








SIGNAL 


new p2iw 


std ulogic vector (31 downto 0); 






SIGNAL 


new_p3i 


; std ulogic vector (opcods 2 downto 0 


) ; 




SIGNAL 


new p3a 


std_ulogic__vector (oprandsz 


downto 


0) ; 




SIGNAL 


new_j?3b 


std ulogic vector (oprandsz 


downto 


0); 




SIGNAL 


new p3c 


: std ulogic_vector (oprandsz 


downto 


0); 




SIGNAL 


new_jp3q 


: std ulogic_vector (qqsiz downto 0) ; 






SIGNAL 


new_j?3__fbit 


: std ulogic; 








SIGNAL 


new p3shimm 


: std ulogic; 








SIGNAL 


new p3shimmf 


: std ulogic; 








SIGNAL 


new p3size 


: std ulogic_vector (1 downto 


0); 






SIGNAL 


new_p3sex 


: std_ulogic; 








SIGNAL 


new p3awb 


: std ulogic; 








SIGNAL 


new p3nocache 


: std_ulogic; 








SIGNAL 


new p3dolink 


: std__ulogic; 








SIGNAL 


new wba 


: std_ulogic_vector (oprandsz 


downto 


0); 




SIGNAL 


iwba 


: std ulogic vector (oprandsz 


downto 


0); 




SIGNAL 


n p2iv 


: std__ulogic; 








S I GNAL 


n p3iv 


: std ulogic; 








SIGNAL 


i awake 


: std_ulogic; 








SIGNAL 


1 go 


: std_ulogic; 








SIGNAL 


n_go 


: std__ulogic; 








SIGNAL 


ni go 


: std ulogic; 








SIGNAL 


i hostload 


: std ulogic; 








SIGNAL 


ip3wb nxt 


: std ulogic; 








SIGNAL 


ip21imml 


; std_ulogic; 








SIGNAL 


ip21imm2 


: std_ulogic; 
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SIGNAL 


ien3 non iv 


: std_ulogic; 


SIGNAL 


ien2 non iv 


: std ulogic; 


SIGNAL 


ip3 load_stall 


: std__ulogic; 


SIGNAL 


isc_regl 


: std_ulogic; 


SIGNAL 


isc reg2 


: std ulogic; 


SIGNAL 


isc loadl 


: std ulogic; 


SIGNAL 


isc load2 


: std_ulogic; 


SIGNAL 


ihp2 Id nscl 


: std ulogic; 


SIGNAL 


ihp2 Id nsc2 


: std ulogic; 


SIGNAL 


ihp2 Id nsc 


: std ulogic; 


SIGNAL 


ibch holdp2 


: std ulogic; 


SIGNAL 


ibch p3flagset 


: std_ulogic; 


SIGNAL 


ildvalid wb 


: std__ulogic; 


signal 


ip2ivalid r : std 


ulogic; 


signal 


ip21imm data_r : 


std_ulogic_vector (31 downto 0) 


signal 


i p2merge valid r 


: std ulogic; 


signal 


i fst if etch r : 


std ulogic; 


signal 


ijp2 f st if etch_r 


: std_ulogic; 


signal 


i_fetchen : std_ulogic; 


signal 


i_pending_kill_r 


: std_ulogic; 


signal 


i cancel kill r : 


std ulogic; 


signal 


i_ifetch_r : std__ulogic; 


signal 


impending : std__ulogic; 


signal 


i pi used__r : std_ulogic; 



BEGIN 

— New Outputs 

p21imm_data_r <= ip21imni_data_r ; 
p2merge_valid_r <= i_p2merge_valid_r; 
ipending <= i_ipending; 
** stage 1 ** 



merge_process : process (ck, clr) 
begin 

if clr = 1 1' then 

ip2ivalid_r <= T 0 T ; 

i_p2merge_valid_r <= '0'; — PS 

ip21imm_data_r <= (others => r 0 c ); 
i_fst_ifetch_r <= ' l ! ; 
i_p2_f st_if etch_r <= ' l f ; 
i_pending_kill_r <= '0 T ; 
i__cancel_kill_r <= 1 0 1 ; 
i_pl_used_r <= ' 0 1 ; 

elsif (ck 1 EVENT and ck = r l f ) then 
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Latch ivalid for use in stage 2 
ip2ivalid_r <= ivalid; 

— Latch in long immediates when an instruction in stage 2 
references a long immediate and its available in stage 1 
Record that the long immediate is available and has been 

— merged with the opcode. 

— Indicate that the dataword in stage 1 has been used. 

if ivalid = '1 ! and ip21imm = ! 1 T then 

ip21imm_data_r <= pliw; 

i_pl_used_r <= '1 T ; 
i_p2merge_valid_r <= ' 1 ! ; 
end if; 

— Indicate that the dataword in stage 1 has been used. 

if ienl_lowpower = '1' then 

i_pl_used_r <= 1 1 ! ; 
end if; 



O — When a new instruction dataword is requested clear the pl_ 

=fl used flag 

|4 if i_ifetch = T l' then 

i_pl_used_r <= ' 0 ! ; 
end if; 

y 

— When the instruction in stage 2 moves and it references a Ion 



— immediate clear the p2merge valid flag. 

if ien2 = 'I 1 and ip21imm = '1' then 
i_p2merge_valid_r <= ' 0 1 ; 
end if; 

— Start up the pipeline so stage 1 can advance stage 0 when 

— stage 0 is stalled or an ivic is requested. 

if ivic = 1 1 T or i_ifetch = 1 0 ! then 
i_fst_ifetch_r <= '1'; 
i__p2__fst_ifetch_r <= '1 T ; 

end if; 



— Clear the ifetch advancement flag 

if i_ifetch = T l' then 

i_fst_ifetch_r <= ' 0'; 
end if; 

— Clear the ifetch advancement flag 

if i_fst_ifetch_r = 1 0 f and ivic = ' 0 1 then 

i_p2_fst_ifetch_r <= T 0'; 
end if; 



— Re-intialize to ifetch advancement flags 

— since ifetching has been stalled 



-54- 



if ivic = f l T or i_ifetch = t 0 l then 
i_fst_if etcher <= T l'; 
i_p2_fst_ifetch_r <= '1'; 

end if; 



if ivalid = r 0' and ip2kiilnext = '1' and ien2 = 1 0' then 

i_cancel_kill_r <= 1 1 r ; 
end if; 

if ien2 - ' 1' then 

i_cancel_Jcill_r <= ' 0 ! ; 
end if; 



if ivalid = r 0' and ip2killnext = '1' and ien2 - ' 1 T 
and i_cancel_kill_r = r 0 T then 
i_pending_kill_r <= ' 1'; 
end if; 

Clear the pending instruction kill flag when the instruction 
— is killed 

if i_pending_kill_r = f l f and ivalid = ' l f then 

i__pending_kill_r <= '0' ; 
end if; 

— Not used . . . 
i ifetch r <- i ifetch; 



end if; 
end process; 



** Stage 1 logic ** — 

The breakpoint instruction is determined at stage 1 from: 
[1] Decode of pliw, 

[2] Instruction at stage 1 is valid, 

[3] The instruction is not killed, 

[4] The instruction is not long immediate data, 

[5] There is no sleep instruction in stage 2. 

brk_decode <= T l f WHEN (pliw (instrubnd downto instrlbnd) = oflag) 

AND (pliw (copubnd downto coplbnd) = so_brk) 
AND (pliw(shimmlbnd) = ! 0 ! ) 
AND (ip2ivalid_r = '1') else 

'0'; 

i_brkjpass <= NOT (ip2killnext ) AND 
NOT(ip21imm) AND 
NOT (ip2sleep_inst) ; 

i brk inst <= ibrk decode AND i_brk_j?ass; 
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brk_inst <= f O r ; — i_brk_inst; 
** Stage 2 **- 



The sleep instruction is determined at stage 2 from: 
[1] Decode of p2iw, 

[2] Instruction at stage 2 is valid. 



ip2sleep_inst <= ? 1 T WHEN (ip2iw (instrubnd downto instrlbnd) = of lag) 

AND (ip2iw (copubnd downto coplbnd) = so_sleep) 
AND (ip2iw(shimmlbnd) = '1') 
AND (ip2iv = T l') ELSE 

'0 T ; 

p2sleep_inst <= ip2sleep_jLnst ; 



— The data to be used for input to stage 2 is latched here. 

— Clock the instruction presented by the memory controller when ienl is 

true. - Also clock p2iv, which is ivalid clocked when ienl is true. 
This signal is used to indicate which instructions in the pipeline are 
real and which are junk which is being allowed to flow through to keep 

— things running. 

p2ins: pipe32 PORT MAP (ck, ienl_lowpower , clr, pliw, ip2iw) ; 

new_p2iw <= pliw WHEN ienl_lowpower = T 1 T ELSE ip2iw; 

p2ins : PROCESS {ck, clr) 
BEGIN 

IF clr - '1' THEN 

ip2iw <= (others => ! 0 ! ); 

ELSIF (ck 1 EVENT AND ck = '1') THEN 

if ienl_lowpower = ! 1 ! and ip21imm = ! 0 ! then 

end if; 
ip2iw <= new_p2iw; 
END IF; 
END PROCESS; 



The various component parts of the instruction are extracted here to 
internal signals. 

ip2i <= ip2iw ( instrubnd downto instrlbnd); — opcode 

ip2a <= ip2iw (aopubnd downto aoplbnd) ; -- a field 

ip2b <= ip2iw (bopubnd downto boplbnd) ; — b field 
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ip2c <= ip2iw (copubnd downto coplbnd) ; 

ip2_fbit <= ip2iw(setflgpos) ; 

ip2q <= ip2iw(qqubnd downto qqlbnd) ; 

ip2dd <= ip2iw(ddubnd downto ddlbnd) ; 



c field 
flag bit 
q field 

delay slot mode 



— Output drives of signals direct from the stage 2 input latch, 
(some more extraction takes place also) 



p2i <= ip2i; 

dest <= ip2a; 

fsla <- ip2b; 

s2a <= ip2c; 

p2shimm <= ip2iw (shimmubnd downto shimmlbnd) 

p2cc <= ip2iw(ccubnd downto cclbnd) ; 

p2offset <= ip2iw (targubnd downto targlbnd) ; 



— opcode 

— destination 

— source 1 

— source 2 
short immediate 

— CC field (no x bit) 
branch offset 



— Now some simple decodes from the opcode field are performed. 

— These are for files which do their own decode of the p2i[] field. 

ip2jblcc <=> f l' WHEN (ip2i = oblcc) — branch and link 

OR ((ip2i = ojcc) AND (ip2c(0) = 1 1 1 ) ) — jump and link. 
ELSE '0'; 

— output drives — 

p2jblcc <= ip2jblcc; 

ip2st <= 'l f WHEN (ip2i - ost) AND ip2iw(25) = f 0 T ELSE T 0'; — ST 
instruction, 

p2st <= ip2st; 

mstore2 <= ip2st AND ip2iv; 

— The load instruction has two opcodes ldr (00) and Ido (01) . The aux LR 

instruction is encoded on the Ido instruction, so must be excluded when 

— producing a signal which indicates that a load instruction is in stage 2. 



ip21d 



<= 



' l f WHEN (ip2i = oldr) 

OR (ip2i = oldo AND ip2iw(13) 

f 0' ; 



r 0 T ) 



ELSE 



mload2 
p21r < 
ip21do < 
p21do 



<= 



ip21d AND ip2iv; 

r l T WHEN (ip2i - oldo) and ip2iw(13) = '1 ! ELSE '0'; 
'1 T WHEN (ip2i = oldo) and ip2iw{13) = J Q' ELSE 'O 1 ; 
ip21do; 



output drives 

slen <= islen; 
s2en <= is2en; 
desten <= idesten; 
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Output drive — 
p2condtrue <= ip2condtrue; 



Stage 2 flag setting calculation — 

p2setflags just comes from bit 8 of the instruction word. It is only 
used in flags.vhd and is qualified there with a decode of p2i=ojcc, 
and a check for p2iv and p2condtrue. 

p2set flags <= ip2_fbit; 



Produce signals to pass down the pipeline which indicate to stage 3 
which register fields include immediate data registers, qualified with 
the slen/s2en/desten signals. 

Here four signals are produced, one for each of the combinations of 
the two source fields and the two short immediate data registers 
(i.e. set flags/don't set flags) 



ishi_bf <= 

ishi_bn <= 

ishi_cf <= 

ishi cn <= 



'1' WHEN ( ip2b 

' 1 T WHEN { ip2b 

'l 1 WHEN ( ip2c 

'1' WHEN { ip2c 



rfshimm ) and islen 

rnshimm ) and islen 

rfshimm ) and is2en 

rnshimm ) and is2en 



! 1' ELSE *0' 

'1* ELSE '0' 

'1' ELSE '0' 

ELSE ! 0' 



Now produce signals which indicate whether a short-imm field is present 
at the bottom of the instruction, due to a register (ip2shimm) , 
and indicate whether the flags should be set or not (ip2shimmf ) . 



ip2shimm <= ishijof OR ishi^bn OR ishi_cf OR ishi_cn; 
ip2shimmf <= ishi_bf OR ishi_cf; 



Now extract the extra encoding information used for loads and stores. 
The signals are extracted and latched at the end of stage 2. 



ip2mop_e 



ip2nocache <= 

ip2size <= 

ip2sex <= 

ip2awb <- 



ip2iw(ldo_eubnd downto ldo_elbnd) WHEN ip21do « r l* ELSE 

ip2iw(st_eubnd downto st__elbnd) WHEN ip2st = '1' ELSE 

ip2iw (ldr_eubnd downto ldr_elbnd) ; 

ip2mop_e (ls_nc) ; 

ip2mop_e (ls_subnd downto ls_slbnd) ; 
ip2mop_e (ls_ext) ; 
ip2mop_e (ls_awbck) ; 



Generate signals for pipeline control and interrupt control units — 

p21imm - this will be true when a valid instruction which uses long imm 
data is in stage 2. Note that this signal will include p2iv as it includes 
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— slen/s2en. 

— p2bch - this will be true when a jump instruction bcc/blcc/lpcc/j cc is 

— in stage 2. This also includes p2iv, but explicitly this time. 

— p2pldep - this signal is used to indicate that the instruction at stage 2 

— requires that the next instruction be in stage 1 before it can move off. 

This may be either to ensure correct delay slot operation for a branch 
or to make sure that long immediate data is fetched, and then killed 

— before it can be processed as an instruction. 

ip21imml <= '1' WHEN ip2b = rlimm ELSE '0'; 
ip21imm2 <= '1' WHEN ip2c = rlimm ELSE 'O'; 



ip21imm 
ip2bch 



ELSE 



ip2pldep 

p21imm 

p2pldep 



<= (ip21imml AND islen) OR (ip21imm2 AND is2en) ; 



<= 'l 1 WHEN ip2iv 



' 1* AND (ip2i = obcc OR ip2i = oblcc 
OR ip2i = olpcc OR ip2i = ojcc ) 



'0'; 

<== ip21imm OR ip2bch; 
<= ip21imm; 
<= ip2pldep; 



Pipeline control unit 



i valid U From memory controller. Indicates that the instruction/data 

word presented to the ARC on pliw[31:0] is valid. 

— plint U Indicates that an interrupt has been detected, and an 

interrupt-op will be inserted into stage 2 on the next 
cycle, setting p2int true. This signal will have the 
effect of canceling the instruction currently being 
fetched by stage 1 by causing p2iv to be set false at the 
end of the cycle when plint is true. 



— p2int L Indicates that an interrupt-op instruction is in 

stage 2. This signal is used in coreregs.vhd to control 
the placing of the pc onto a source bus for writing back 
to the interrupt link registers, and by aux_regs to 
insert the interrupt vector int_vec [] into the program 
counter, thus requiring this file to set pcen true. 

p2bch U This signal indicates that the instruction in stage 2 is 

a branch or jump instruction, and therefore requires that 
the instruction following must be present in the delay slot 
before it can move on. 

(Simple decode of p2i[4:Q], and does include p2iv) 

p21imm U This signal indicates that the instruction in stage 2 uses 

long immediate data for one of the source operands. This means 
that the instruction cannot complete until the correct data 
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word has been fetched into stage 1. When the instruction does 
move out of stage 2, the data word is marked as an invalid 
instruction before it gets into stage 2. The data word has 
served its purpose by this point, so it can be overwritten 
by another instruction if stage 3 is stalled, and stage 1 is 
allowed to move on into stage 2 over the top of the data word. 
This signal includes slen/s2en and p2iv. 

holdupl2 U From lsu.vhd. This signal is used to hold up pipeline 

stages 1 and 2 (pcen, enl and en2) when the load store unit 
finds a register being used by the instruction at stage 
2 which is the destination of a delayed load. It will 
also be set when the scoreboard unit is full and the 
ARC attempts to do another load. Stages 3 and 4 will 
will continue running. 

xholdupl2 U From extensions. This signal is used to hold up pipeline 
stages 1 and 2 (pcen, enl and en2) when extension logic 
requires that stage 2 be held up. For example, a core register 
is being used as a window into SRAM, and the SRAM is not 
available on this cycle, as a write is taking place from 
stage 4, the writeback stage. Hence stage 2 must be held to 
allow the write to complete before the load can happen. 
Stages 3 and 4 will continue running. 

p2killnext U This signal indicates that the delay slot mechanism of the 
jump instruction currently in stage 2 is requesting that the 
next instruction be killed before it gets into stage 2, 
This signal is produced from a decode for a jump instruction 
code, the condition-true signal, p2iv and the delay-slot field 
in the instruction. This signal relies on the delay slot 
instruction being present in stage 1 before stage 2 can move 
on. This is handled elsewhere by this file. 



ldvalid U From LSU. This signal is set true by the LSU to 

indicate that a delayed load writeback WILL occur on 
the next cycle. If the instruction in stage 3 wishes to 
perform a writeback, then pipeline stage 1, 2 and 3 will 
be held. If the instruction is stage 3 is invalid, or 
does not want to write a value into the core register 
set for some reason, then the instructions in stages 1 
and 2 will move into 2 and 3 respectively, and the 
instruction that was in stage 3 will be replaced in 
stage 4 by the delayed load writeback. 

** Note that delayed load writebacks WILL complete, even 
if the processor is halted (en=0) . In this instance, the 
host may be held off for a cycle (hold_host) if it is 
attempting to access the core registers. ** 

mwait U From MC. This signal is set true by the MC in order 

to hold up stages 1, 2, and 3. It is used when the 
memory controller cannot service a request for a memory 
access which is being made by the LSU. It will be 
produced from mload3, mstore3 and logic internal to the 
memory controller. 

mload3 U This signal indicates to the LSU that there is a valid 
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xholciupl23 



p3_wb_req 



load instruction in stage 3. It is produced from a decode 
of p3i[4:0], p3iw(13) (to exclude LR) and the p3iv signal. 

It is used here to ensure that a lockup situation cannot 
occur when a branch is holding up stages 1, 2 and 3. 

U From extensions. This is used by extension ALU 
instructions to hold up the pipeline if the function 
requested cannot be completed on the current cycle. 
Pipeline stages 1, 2 and 3 will be held, but the 
writeback (stage 4) will continue. 

U This signal (produced by rctl.vhd) is set true when the 
instruction in stage 3 wants to writeback to the register 
file, i.e. - 

a. A destination register is given (r0-r60) 

b. A link register to be be written (interrupt, BLcc) 

c. LD/ST with .A specified - to do address writeback 



It will be false when no destination is requried, 
a. jumps /branches (not BLcc) 



i.e. 



b. instructions with dest = immediate 

c. instructions for which the condition is false 

d. LD/ST without .A specified - no address writeback 

e. cancelled instructions (p3iv = 1 0 f ) 

f. extension instruction, xnwb = ' l 1 



p3_wb_rsv U This signal is set true when the instruction at stage 3 
wants to reserve the writeback stage for itself. This is 
required when a FIFO-type instruction wants to suppress 
writeback to the register file, but needs the data and 
register address to be present in the writeback stage so that 
it can be picked off and sent into the FIFO buffer. 

Is it generated by rctl.vhd and will be true when an 
extension instruction at stage 3 is suppressing writeback 
with the xnwb signal. 

cr_hostw U This signal is set true to indicate that a host write 
to the core registers will take place on the next cycle, 
and that the end-of-stage 3 data and register address latches 
should clock in the address and data provided by the host. 
*** Note that host writes are overridden by returning delayed 
loads. This signal hold_host will be asserted (produced in 
rctl.vhd) to tell the host to wait for a cycle. *** 

h_pcwr U From pcounter . vhd. This signal is set true when the host 

is attempting to write to the pc/status register, and the 
ARC is stopped. It is used to trigger an instruction fetch 
when the PC is written when the ARC is stopped. This is 
necessary to ensure the correct instruction is executed 
when the ARC is restarted. 



— Outputs: 



enl 



U Stage 2 pipeline latch control. True when an instruction 
is being latched into pipeline stage 2. Will be true 
at different times to pcen, as it allows junk instructions 
to be latched into the pipeline. 
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*** A feature of this signal is that it will allow an 
instruco be clocked into stage 2 even when stage 3 
is halted, provided that stage 2 contains a killed instruction 
(i.e. p2iv = 1 0 f ). This is called a 'catch-up 1 . *** 

U This signal, similar to pcen, indicates to the memory 
controller that a new instruction is required, and should 
be fetched from memory from the address which will be clocked 
into currentpc [25 : 2] at the end of the cycle. It is also 
true for one cycle when the processor has been started 
following a reset, in order to get the ball rolling. 

An instruction fetch will also be issued if the host changes 
the program counter when the ARC is halted, provided it is 
not directly after a reset. 

The ifetch signal will never be set true whilst the memory 
controller is in the process of doing an instruction fetch, 
so it may be used by the memory controller as an 
acknowledgement of instruction receipt. 

U This signal is true when an instruction fetch has been 
issued, and it has not yet completed. It is not true directly 
after a reset before the ARC has started, as no instruction 
fetch will have been issued. It is used to hold off host 
writes to the program counter when the ARC is halted, as 
these accesses will trigger an instruction fetch. 

This signal is true when the pc is allowed to change state. 
It takes account of ivalid (stage 1 has fetched a valid 
instruction) and the interrupts which need to be able to 
prevent the pc from updating. 

*** A feature of this signal is that it will allow an 
instruction to be clocked into stage 2 even when stage 3 
is halted, provided that stage 2 contains a killed instruction 
(i.e. p2iv = ' 0 1 ). This is called a 1 catch-up'. *** 

U Stage 3 pipeline latch control. Controls transition of 
instruction in stage 2 to stage 3. Will be set false if the 
op in stage 2 requires data from stage 1 which is not 
forthcoming because the instruction cannot be fetched to 
stage 1 during this cycle (i.e. ivalid = 1 0 1 ). This condition 
will occur for instructions which use long immediate data 
or for jump/branch instructions which require the correct 
instruction to be in the delay slot. 

U Stage 3 instruction completion control. This signal is set 
true to indicate that the instruction in stage 3 can complete 
at the end of the cycle and pass out of pipeline stage 3. It 
may or may not pass into stage 4 (the writeback stage), 
depending on whether a writeback is required or not . Taken 
on its own, this signal controls writeback to the flags. 

U Stage 4 pipeline latch control. Controls transition of the 
data on the p3result [31 : 0] bus, and the corresponding register 
address from stage 3 to stage 4. As these buses carry data 
not only from instructions but from delayed load writebacks 
and host writes, they must be controlled separately from the 
instruction in stage 3. This is because if the instruction in 
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stage 3 does not need to write a value back into a register, 
and a delayed load writeback is about to happen, the 
instruction is allowed to complete (i.e. set flags) whilst 
the data from the load is clocked into stage 4. If however 
the instruction in stage 3 DOES need to writeback to the 
register file when a delayed load writeback is about to 
happen, then the instruction in stage 3 must be held up 
and not allowed to change the processor state, whilst the 
data from the delayed load is clocked into stage 4 from 
stage 3. 

*** Note that p3wb_en can be true even when the processor 
is halted, as delayed load writebacks and host writes use 
this signal in order to access the core registers. *** 

wben L This signal is the stage 4 write enable signal. It is 

latched from p3wb_en. Stage 4 is never held up. 

p2iv L Pipeline stage 2 instruction valid. This latched signal 

indicates that the instruction in stage 2 is valid. When it 
— is set false, the instruction is stage 2 is either a junk 

value clocked in to keep the pipeline running, or an 
instruction which was killed by the interrupt system. 

p3iv L Pipeline stage 3 instruction valid. This latched signal 

indicates that the instruction in stage 3 is valid. When it 
is set false, the instruction is stage 3 is either a junk 
value clocked in to keep the pipeline running, or an 
instruction which was killed by the interrupt system, or 
a blank slot inserted when the instruction in stage 2 was 
not allowed to complete on the previous cycle. This blank 
slot must be inserted otherwise the instruction which was 
executed by stage 3 during the previous cycle will be executed 
again during the current cycle. 

pcen : Program counter update enable 

This signal indicates to the program counter that a new value can be 
loaded. This will be the case when: 

a. A valid instruction has been fetched and can be passed on to 
stage 2, allowing the memory controller to start looking for the 
next instruction to be executed. 

*** Note that this logic handles the case when stage 2 contains 
an invalid instruction which is held due to stall in stage 3, and 
we allow the instruction in stage 1 to move into stage 2, 

b. An interrupt is in stage 2, and the interrupt vector is to be 
clocked into the program counter. The instruction now being 
fetched into stage 1 will be killed anyway, but we must wait 
until it has been fetched to be sure that we do not issue a new 
fetch request to the memory controller before the last one has 
completed. 

The interrupt vector should only be clocked into the program 
counter when the interrupt can move out of stage 2. This will 
ensure that the correct pc value will be placed in the interrupt 
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link register. 

-- We will also want to forcibly prevent the program counter from being 
updated in some cases: 

a. An interrupt has been recognized, and we want to kill the 
instruction currently in stage 1, and not increment the program 
counter in order to ensure the correct PC value is stored into 
the appropriate interrupt link register. 

b. The breakpoint instruction (or valid actionpoint) is detected and 
the pipeline is to be flushed, and then halted. 

c. A single instruction step is being executed, whilst preventing 
another ifetch from being generated in order to only execute one 
instruction at a time. During a single instruction step the PC is 
only allowed to be updated and (thereby generating a new ifetch) 

— when : 

1. a valid instruction in stage 1 is allowed to pass into 
stage 2. 

2. a branch or jump instruction is in stage 2 has a killed 
delay slot. 

3. an instruction is in stage 2 that uses a long immediate. 

4. an interrupt has been detected and is now in stage 2. 



*** Note that if an invalid instruction in stage 2 is held (this will be 
due to a stall at stage 3) then the instruction in stage 1 will be 
allowed to move into stage 2. *** 

— Added to allow pc updates to advance ifetching 

— ip2ivalid_r prevents the core advancing more than 1 cycle 

— ij?2_f st__ifetch_r = ' 1' and i_fst_if etcher = '0 1 allow the core to 

— initially advance ifetch to get thing 'rolling' 



ipcen <= '0 T WHEN en - '0 1 

OR (ip2ivalid_r - '0' and not (i_p2_f st_if etch_r = '1' and 
i_fst_ifetch_r = ' 0 T )) 

— or (ip21imm = '1' and i_p2merge_valid_r = f l') 
OR (p2int = ' 1 T AND ien2_non_iv = ' 0 1 ) 
OR (ip2iv = T l' AND ien2_non_iv = '0') 
OR (i_break_stagel = T l') 

or (ip21imm = r l' and ip2killnext = 'I 1 and 

ijp2merge__valid_jr = ' 0 ' ) 

OR inst_stepping = '1' 

OR plint - ? 1' ELSE 

'1'; 

inst_stepping : PC Disable for single instruction step 



— The signal inst_stepping prevents the PC from being updated, by disabling 

— the PC enable signal (pcen) . The signal is set when a single instruction 
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step is being performed and the PC does not need to be updated 
(pcen_step = ' 0 ' ) . 



inst_stepping <= '1' WHEN do__inst_step = l l ! 

AND pcen_step = 1 0 ' ELSE 

•0'; 

The signal pcen_step is set when a single instruction step is being 
executed if the PC needs to be updated. This happens in the following 
cases : 

a. A valid instruction in stage one is allowed to pass into 
stage 2 . 

b. A branch or jump in stage 2 has a killed delay slot. 

c. An instruction using long immediate is in stage 2. 

d. An interrupt has been detected and is now in stage 2. 



pcen_step <- T l f WHEN (do_inst_step = '1' 
AND p2step = '0') 

OR (p2step - ' l 1 
AND (ip2killnext = ! 1' 

OR ip21imm = T l' ) ) 

OR p2int = 1 l f 

ELSE 

T 0'; 

stop_step : stop single instruction step when finished 

The stop_step signal is related to single instruction step. When the 
single instruction has been completed the stop_step signal goes high. 
Depending on the type of instruction the stop is made in different 
places in the pipeline: 

a. Branches and jumps with delay slots that are not killed stop in 
stage 2, because the instruction in the delay slot count as a new 
instruction. Next instruction step will execute the branch 

and the delay slot. 

b. All other instructions complete in stage 3 (if writeback is not 
performed) or stage 4 (if writeback is performed) . 

When the stop_step signal goes high the ARC is halted, 

the step tracker signals (below) are reset and a new instruction fetch 
is generated. 

istop_step <= 'l 1 WHEN (ip2bch = 1 1' 

AND ip21imm - '0 ? 
AND ip2killnext = T 0' 
AND p2step = '1' ) 
OR (p3step - '1 T 
AND ip3wb_en - 1 0 1 ) 
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ELSE 



stop_step <= istop__step; 

step tracker : keeps track on the single step instruction ~ 

The step_tracker process keeps track on where in the pipeline the 
instruction is during single instruction step. It generates three 
tracking signals: plp2step, p2step and p3step. The signal p2step 
is high when the instruction is in pipestage 2 and p3step is high 
when the instruction is in pipestage3. As you see in the timing 
diagram below p2step and p3step stays high after being set until 
the cycle after the stop signal stop_step is issued, which means 
that the instruction has completed. 

Here is an example how the step tracker process works for an 
instruction with writeback and no long immediate. The pipeline 
is clean before the step starts. 



— ck / \ / \ / \ / \ / \ 



do_inst_step / \ 



ienl / \ 

ivalid / 

plp2step / 

p2step / 

p3step / \ 

ip3wb_en / \ 

stop_step / \ 



The signal plp2step is set when a valid instruction has moved from 
stage 1 to stage2. This signal sets p2step. But p2step is not only 
set by plp2step but also if there is already an instruction in 
stage 2 that uses long immediate or has a killed delay slot or if 
an interrupt is in stage 2 (p2int is set) . This can happen if the 
ARC was just halted after running in free-running. The pipeline 
can then be filled with anything in this situation. This can only 
happen on the first instruction step after free-running mode. On 
the second consecutive instruction step the pipeline will be clean. 

p2step <= plp2step OR 

(do_inst_step AND (ip21imm OR ip2killnext OR p2int) ) ; 

step_tracker: PROCESS (ck, clr) 

BEGIN 

IF clr = 1 1 1 THEN 



-66- 



plp2step <= ' 0' ; 
p3step <= ! 0 T ; 
ELS IF (ck f EVENT AND ck = 1 1 T ) THEN 

IF istop_step = ' l 1 THEN 

plp2step <= 1 0'; 
ELSIF (ienl = '1 T AND (ivalid = 1 1 1 OR plint - 1 1 1 )) THEN 

plp2step <= do_inst_step; 
END IF; 

IF istop_step = 'l 1 THEN 

p3step <= T 0' ; 
ELSIF ien2 - '1' THEN 

p3step <= p2step; 
END IF; 

END IF; 

END PROCESS; 



— A load of signals inserted to reduce the complexity of the logic 
minimization task for the ivalid signal 

ien2 noniv <= T 0 T WHEN en = ' 0 ! 

OR ien3_non_jlv = ' 0 ' 
OR (holdupl2 OR ihp2_ld_nsc) - f l T 
OR xholdupl2 = '1' 
OR ibch_holdp2 = '1' ELSE 

'1'; 

r 0 r WHEN en = f 0' 

OR (xholdupl23 AND xt_aluop) = T l 1 
OR mwait = T 1 1 
OR ip3_load_stall = '1' 

T l'; 

— ifetch : Tell M/C to do a fetch 

This signal is used to tell the memory controller to do another 
instruction fetch with the program counter value which will appear at 
the end of the cycle. It is normally the same as pcen except for when 

— the processor is restarted after a reset, when an initial instruction 

fetch request must be issued to start the ball rolling. 

In addition, ifetch will be set true when the host is allowed to change 
the program counter when the ARC is halted. This will means that the new 

— program counter value will be passed out to the memory controller 
correctly. The ifetch signal is not set true when there is an instruction 

— fetch still pending. 

— Signal i_awake will be true for one cycle after the processor is started 

— after a reset. 

i_awake <= en AND NOT l_go; 

— Signal i_hostload will be true when an new instruction fetch needs to be 
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ien3 non iv <~ 



ELSE 



- issued due to the host changing the program counter. 



i_hostload <= ' 1 T WHEN h_pcwr = ! 1' 

AND ip2ivalid__r = '1' AND n_go = ' 1 ? else — 

T (V; 



i_fetchen <= '0' WHEN en = 1 0 1 

OR (ip2ivalid_r = '0' 

and not (i_p2_f st_if etch_r = 1 1 1 

and i_fst_if etcher = ' 0 1 )) 
OR (p2int - T l f AND ien2_non_iv = ' 0 1 ) 
OR (ip2iv = 1 1' AND ien2_non_iv - 1 0') 
OR (i_break_stagel = 'I 1 ) 
OR inst_stepping = 1 1 1 

OR plint = 1 1' ELSE 
'1'; 



The ifetch signal comes from either pcen, kick-start after reset, or 
— when a fetch is required as the host has changed the program counter. 



i_ifetch <- i_fetchen OR i_awake OR i_hostload; 
— ARC3 i_i fetch <= pcen OR i_awake OR i_hostload; 

— The latch is set true after the processor is started after a reset, and 

— will stay true until the next reset. 

— l__go is taken low when the instruction cache is invalidated 

This is in order to prevent a lockup situation 



n_go <= en OR l__go; 

ni_go <= n_go AND not ivic; 

lego: PROCESS (ck, clr) 



BEGIN 



IF clr = '1' THEN 

l_go <= '0 ! ; 
ELSIF (ck 1 EVENT AND ck = '1') THEN 

l__go <= ni_go; 
END IF; 



END PROCESS; 



ipending : An instruction is being fetched 

— This signal is set true when an instruction fetched has been issued, 

— (i.e. not directly after reset) and the fetch has not yet completed, ■ 

— signaled by ivalid = ' 0 T . 

— It is used to prevent writes to the pc from the host from generating 

— an ifetch request when there is already an instruction fetch pending. 

— Host accesses are rejected with hold_host, generated in hostif.vhd 
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ipend : process (ck, clr) 
begin 

IF clr = T 1 T THEN impending <= f 0'; 
ELSIF ck - ! 1' AND ck' event THEN 

entry state : when ARC is started onwards 
IF i_ifetch = f l ! THEN impending <= T l'; 
END IF; 

— entry state : when ARC is started onwards 

IF i_ifetch = f l ! THEN impending <= ' 1 T ; 
END IF; 

— exit state : i.e. when no more fetches are required 

— Or: An instruction cache invalidate puts us back into 

the immediately post-reset condition. 

IF (i_i fetch = '0 T AND ivalid - '1') 

OR (ivic = '1') THEN ipending <= T 0'; 

END IF; 



— Pending ifetchs in the core are cancelled when an ivalid 
— returns from the cache or an invalidate is requested 

IF (--i_i fetch = '0' AND 
ivalid = f 1' ) 

OR (ivic = 'I 1 ) THEN impending <= ' 0 T ; 

END IF; 

END IF; 
END PROCESS; 



enl : Pipeline 1 -> 2 transition enable 



-- This signal is true at all times when the processor is running except 

— when: 

--a. A valid instruction in stage 2 cannot complete for some reason, or 

if an interrupt in stage 2 is waiting for a pending instruction fetch 
to complete. 

— b. A breakpoint instruction (or valid actionpoint) is detected and 

stage 2 has to be halted, while the remaining stages are flushed, and 
then halted. 

c. The single instruction has already moved on to stage 2 and this 
instruction does not depend on the following instruction. 
This is a special case that only happens during single instruction 
step. Because single instruction step finishes the instruction that 
was in pipeline stage 1, this is actually the starting mechanism of the 
single instruction stepping. The next instruction is not allowed 
to pass on until the instruction in further down the pipe has 
completed and not until a new single instruction step command 
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has been generated. 

*** N 0 te that if an invalid instruction in stage 2 is held (this will be 
due to a stall at stage 3) then the instruction in stage 1 will be 
allowed to move into stage 2. *** 

— An additional disable flag is added to cope with ifetch stalling and 
— the cache keeping ivalid high even though the instruction in stage 1 
— has moved to stage 2 or further. 

ienl <= '0 T WHEN en - ' 0 1 

OR (p2step = 1 1 1 AND ip2pldep = T 0' AND p2int = '0 T ) 

OR (i_break_stagel - '1') 

OR (p2int = f l' AND ien2 = r 0 r ) 

OR (ip2iv - '1 T AND ien2 = ' 0 T ) 

or i_pl_used_r = 1 l 1 

or (i_if etcher = 1 0' and impending = T 0 T ) 
or (impending = T 0 ' and i_awake = T 0 T ) 
ELSE 

T l\* 

— The signal ienl_lowpower (below) is almost always equal to ienl (above), 
except 

— when the opcode is not valid. This prevents invalid opcodes to propagate to 

— pipeline stage 2 and thereby power is saved. 

-- This is ONLY used to enable the p2iw latch. The global EN1 stays as normal. 

— The ivalid signal is also used in sync_regs to switch off RAM reads when the 
new instruction is not valid. 



ienl_lowpower <= '0' WHEN (ivalid - f 0 T ) ELSE 

ienl ; 



en2 : Pipeline 2 -> 3 transition enable 

This signal is true when the processor is running, and the instruction 

in stage 2 can be allowed to move on into stage 3. It may be held up for 
a number of reasons: 



a. A register referenced by the instruction is currently the subject 
of a pending delayed load, (holdup!2 from the scoreboard unit) . 

b. Stage 2 contains an instruction which requires a long immediate 
data value from stage 1 which cannot be fetched on this cycle. 
(ip21imm - '1' ) 

c. Stage 2 contains a jump/branch instruction, which require that the 
correct instruction be present in the delay slot following the 
jump/branch instruction. 

( ip2bch = '1', ivalid - '0') 

d. An interrupt in stage 2 is waiting for a pending instruction 
fetch to complete before issuing the fetch from the interrupt 
vector. 



e. A valid instruction in stage 3 is held up for some reason. 



-70- 



- Note that stage 3 will never be held up if it does not contain 
a valid instruction. 



f . Extensions require that stage 2 be held up, probably due to a 
register not being available for a read on this cycle. 

g. The branch protection system detects that an instruction setting 
flags is in stage 3, and a dependent branch is in stage 2. Stage 2 
is held until the instruction in stage 3 has completed. 

h. The opcode is not valid (ip2iv = 1 0 T ) and this is not due to an 
interrupt (i.e p2int = T 0'). This is done to reduce power 

consumption . 

i. The actionpoint debug mechanism or the breakpoint instruction 

is triggered and thus disables the instructions from 
going into stage 3 when the instruction in stage 1 is the delay slot 
of a branch/ jump instruction. 

j . A branch/jump with a delay slot that is not killed is in stage 2 
during single instruction step. 

— All ivalid have been changed to p2ivalid_r so the processor doesn't 

— get more than one cycle ahead 

— Additionally instructions in stage 2 referencing a limm can only move 

— after the limm is merged. 

— Also when stage 2 is stalled when pi has be used . . . 

ien2 <= ' 0' WHEN en = f 0 ! 

OR ien3 = '0' 

OR (holdupl2 OR ihp2_ld_nsc) - ' 1' 
OR xholdupl2 = '1' 

or (i_p2merge_valid_r = ! 0' and 
p21irnm = 1 1 ! ) 

OR (p2int = 'l 1 AND ip2ivalid_r = '0') 
OR (ip2bch = '1' AND ip2ivalid_r = T 0 1 ) 
OR (ip21imm - 'l 1 AND ip2ivalid_r = 1 0 1 ) 
OR ibch_holdp2 = ' 1 ' 
OR (ip2iv = '0' AND p2int = '0') 
OR (iJoreak_stage2 - ' 1') 
or i_pl_used_r = '1' 
OR (plp2step = f l ! AND ip2bch = f l r 
AND ip21imm = T 0' AND ip2killnext = 1 0') 

ELSE 

! 1'; 



ibch holdp2 : Branch protection system 



In order to reduce code size, we want to remove the need to have a NOP 

between setting the flags and taking the associated branch. 

e.g. sub.f 0,r0,23 ; is r0=23? 

nop ; padding instruction. << 

bz rO is 23 ; 
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— In order that the compiler does not have to generate these instructions, 

— we can generate a stage 2 stall if an instruction in stage 3 is attempting 

— to set the flags. Once this instruction has completed, and has passed out 

— of stage 3, then stage 2 will continue. 

— We need to detect the following types of valid instruction at stage 3: 

i. Any ALU instruction which sets the flags (p3setflags) 

— ii. Jcc.F or JLcc.F 

— iii. A FLAG instruction. 

ibch_p3flagset <= ip3iv WHEN (ip3setflags = ' 1 T ) 

ALU OR ((ip3i = ojcc) AND (ip3_fbit = 'l 1 )) 

Jcc/ JLcc . ^ , , _ T „ 

OR ((ip3i = oflag) AND (ip3c = so_f lag) ) ELSE - 

FLAG 

'0 ! ; 

— In order to generate the stall, we also need to detect a valid branch 
instruction 

— present in stage 2 (ip2bch) . 

— We generate a stall when the two conditions are present together: 

— a. An instruction in stage 3 is attempting to set the flags 

— b. A branch instruction at stage 2 needs to use these new flags 

— Note that it would be possible to detect the following conditions to give 

— theoretical improvements in performance. These are very marginal, and have 

— been left out here for the sake of simplicity, and the fact it would be 

— difficult for the compiler to take advantage of these optimizations. 

— Both cases remove the link between setting the flags and the following 

— branch, either because the flags don't get set, or because the branch doesn 

— check the flags. 

— i. Conditional flag set instruction at stage 3 does not set flags 

e.g. add.cc.f rO,rO,rO, resulting in C=l 

-- ii. Branch at stage 2 uses the AL (always) condition code. 

ibchJioldp2 <= '1' WHEN (ibchj?3flagset = '1 ! ) " P3 setting 

flagS AND (ip2bch = '1') ELSE — branch in p 



en3 : Stage 3 instruction completion control 



— This signal is true when the processor is running, and the instruction 

- in stage 3 can be allowed complete and set the flags if appropriate, 
stage 3 may be prevented from completing for a number of reasons: 

a An extension multi-cycle ALU operation has requested extra time 
to complete the operation (xholdopl23) . Note that this can only 
be the case when extension alu operations are enabled with the 
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xt_aluop constant in extutil . vhd. 



b. The memory controller is busy and cannot accept any more load or 
store operations, {mwait) 

c. Deleted in v6. 



p2iv : Stage 2 instruction valid 

This signal indicates that stage 2 contains a valid instruction. The 
instruction in stage 2 may not be valid for a number of reasons: 

a. A breakpoint/actionpoint has been detected, and instructions in 
stage two are to be invalidated for when the ARC is to be 
restarted. 

b. The correct instruction word could not be fetched in time, so 
a junk instruction is inserted into the pipeline to keep it 
flowing, 

c. An interrupt was recognized, causing the instruction which was in 
stage 1 (valid or not) to be killed. 

d. The interrupt which was recognized, and which is now in stage 2, 
requires a blank delay slot to perform the jump to the interrupt 
vector. The instruction in stage 1 is therefore killed. 

e. A long immediate data value was required by the previous 
instruction, and is killed to prevent it being executed as a 
real instruction. 



f . The delay slot mechanism of a jump/branch instruction in stage 2 
has decided that the following instruction should be killed. 
Note that this instruction must be present in stage 1 in order 
to be killed, before the pipeline can be moved on. This is 
handled by the en2 signal. 

g. The single instruction in stage 2 will move on to stage 3 the 
next cycle. This is a special case which only occurs during 

a single instruction step. This must be done to avoid the 
instruction from being executed repeatedly in stage 2. The 
reason this does not kill instructions with long immediates 
or delay slots is because of the signal ien2. The signal ien2 
is not set when there is an instruction in stage 2 that uses a 
long immediate or delay slot in stage 1 in this situation. The 
reason is that stage 2 stalls while another fetch is being done 
in order to get the LIMM/delay slot. 

The appropriate value is latched into p2iv when the instruction in stage 1 

— is allowed to move into stage 2. 

— Jumps can move independently of delay slot instructions in this case 

— delays which need to be killed are killed by pending kill logic 
( i_pending_kill_r ) 

n_p2iv <= '0' WHEN ( (i_break_stagel = '1') AND 

(ijbreak_stage2 = '0') AND ien2 = T l ! ) OR 
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(p2step = ' 1 T AND ien2 - T l ? ) ELSE 
ip2iv WHEN ienl = T 0' ELSE 
1 0 ? WHEN (plint OR p2int) = '1' 

OR ip21imm = 1 1 T 

OR ip2killnext = 'l 1 ELSE 



OR ip2killnext = '1' 

or i_pending_kill_r - 'l 1 



ELSE 

ivalid; 



p2ivreg : PROCESS (ck, clr) 



BEGIN 



IF clr = '1' THEN 

ip2iv <= ! 0 f ; 
ELSIF (ck 1 EVENT AND ck = ' 1' ) THEN 

ip2iv <= n_p2iv; 
END IF; 



END PROCESS; 

p3iv : Stage 3 instruction valid 

This signal indicates that stage 3 contains a valid instruction. The 
— instruction in stage 3 may not be valid for a number of reasons: 



a. The instruction was marked as invalid when it moved into stage 2, 
i.e. p2iv - ' 0' . 

b. The instruction in stage 2 has not been able to complete for some 
reason, and the instruction in stage 3 has been able to complete 
and will move on at the end of the cycle. It is thus necessary 

to insert a blank slot into stage 3 to fill in the gap. If this 
is not done, the instruction which was in stage 3 will be 
executed again, and this would of course be *bad news*. 



n_p3iv <= ip3iv WHEN ien3 - ' 0 ! ELSE 
'0' WHEN ien2 = '0 f AND ien3 = r l' ELSE 

ip2iv; 



p3ivreg : PROCESS (ck f clr) 
BEGIN 

IF clr = THEN 

ip3iv <= ' 0' ; 
ELSIF (ck 1 EVENT AND ck = ' 1 T ) THEN 

ip3iv <= n_jp3iv; 
END IF; 

END PROCESS; 
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Disable Logic to Stall the ARC 

for Actionpoint System 

— The pipeline flushing mechanism has been introduced to support the 
breakpoint instruction and actionpoint hardware. Each stage of the 

-- pipeline is stalled explicitly, and once all stages one, two and three 

— have been stalled the ARC is stalled via en bit 



— This signal is true when both of the the following conditions are true: 

a. The instruction in stage one should be killed when it advances 
into stage two. 

b. The actionpoint mechanism was set by the hardware breakpoint 
alone. 

i_kill_AP <= ip2killnext AND hw_brk_only; 

— The stalling signal for stalling enl is defined by i__break_stagel, and 

— this is set to 'I 1 on the following conditions: 

a. The breakpoint instruction has been detected at stage one, i.e. 
i_brk_inst = ! l ! or an actionpoint has been triggered by a valid 
signal from the OR-plane. 

b. The instruction in stage one of the pipeline is to be executed, 
and not killed. 



c. The sleep instruction has been detected in stage 2. 

d. The ARC is sleeping already (sleeping = '1') due to a sleep 
instruction that was encountered earlier. 



i_break_stagel <= f l f WHEN i_brk_inst = '1' 

OR ip2sleep_inst = f l' 
OR sleeping = '1* 
OR (actionhalt = 1 1' 

AND i_kill_AP = f 0 r ) ELSE 

'0'; 



— The stalling signal for stalling en2 is defined by i_break_stage2, and 
this is set to T l 1 on the following conditions: 

a. A breakpoint /actionpoint instruction has been detected at stage 
one, i.e. i_brk_inst = 'l 1 . For example, an actionpoint has 
been triggered by a valid signal from the OR-plane, 
This has to true when there is an instruction 
in stage one is in the delay slot of a branch, jump or loop 
instruction. It can also be long immediate data. 



b. A breakpoint /actionpoint instruction has been detected at stage 
one, i.e. i__break_stagel = '1'. For example, an actionpoint has 
been triggered by a valid signal from the OR-plane. 

This has to true when there is an interrupt in 
stage two, i.e. p2int = f l f 
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i_break_stage2 <= ' l 1 WHEN ( (ip2pldep - ' l 1 OR p2int = ' 1') 

AND (i_break_stagel = '1')) ELSE 

'0'; 

— As the pipeline is flushed of instructions when the breakpoint instruction 

— or a valid actionpoint is detected it is important to disable each stage 

— explicity. These signals have to follow the last instruction which is being 

— allowed to complete. A normal instruction in stage one will mean that 

— instructions in stage two, three and four will be allowed to complete. 
However, 

— for an instruction in stage one which is in the delay slot of a branch, loop 

— or jump instruction means that stage two has to be stalled as well. 
Theref ore, 

— only stages three and four will be allowed to complete. 

— The qualifying valid signal for stage two is defined by i_n_AP_p 2 disable, 

— and this is set to ! 1 T on the following conditions: 

a. There is an instruction in stage two which has a dependency in 
stage one, i.e. i_break_stage2 = '1'. 

b. The breakpoint instruction or actionpoint has been detected, i.e. 
i_break_stagel = 1 1' and the instruction in stage two is enabled, 
ien2 = '1', and the instruction is allowed to move on. 

c. The breakpoint instruction or actionpoint has been detected, i.e. 
i_break_stagel = 1 1 1 and the instruction in stage two is invalid, 
ip2iv = ' 0' . 



i_n_AP_p2di sable <- '1' WHEN iJoreak_stage2 - T l' OR 

(i__break_stagel = 1 1 1 AND 
( (ien2 - 1 1' AND ip2iv = ' 1') OR 
ip2iv = '0' ) ) 

ELSE 

'0'; 



The qualifying valid signal for stage three is defined by i_n_AP_jp 3 disable, 
and this is set to r l f on the following conditions: 

a. The instruction in stage two is invalid, i_AP_p2disable_r = '1' . 
Also the instruction in stage three is enabled, en3 = T l', and 
the instruction is allowed to move on. 

b. The instruction in stage two is invalid, i_AP_p2disable_r = ' 1 ! . 
Also the instruction in stage three is invalid, ip3iv = '0 1 . 

i_n_AP_p3disable <= '0 1 WHEN i__break_stagel = f 0 T ELSE 

' 1 ' WHEN £ i_AP_p2disable_r = 1 1 ' ) AND 

( (ien3 = '1' AND ip3iv = '1') OR 
ip3iv = ! 0' ) ELSE 

<0'; 



update_AP_disable : PROCESS (ck, clr) 



BEGIN 
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I F clr = 1 1 1 THEN 

i_AP_j?2disable_r <= 1 0 r ; 

i_AP_p3disable_r <= T 0'; 
ELSIF (ck * EVENT AND ck = '1') THEN 

i AP_p2disable__r <= i_n_AP_p2 disable; 

i~AP_p3disable_r <= i_n_AP_p 3 disable; 
END IF; 

END PROCESS; 

— Output Dives for halting the ARC 

AP_p3disable__r <= i_APjp3disable_r; 



END synthesis; 




-11- 



